Obtaining the CHAINS corpus by FTP

The corpus has been packed into tar archive files and compressed and split into chunks of no more than 100 Mb using the 7zip archive tool. 7zip is available as free software and distributed under the GNU LGPL. The 7zip file archiver is available for a number of platfomrs (e.g. Windows/OS X/Linux) from http://www.7-zip.org/download.html.

To obtain the uncompressed tar archive, download each part of the relevant speaking condition into the same folder and proceed with the decompression. 7zip will automatically recognize the multiple volume archive and concatenate the outputs into a single tar file.

In a unix-like environment the decompression is as easy as typing:

7za e ./<condition>.tar.7z.001

Each tar file containes the recordings under the directory "data/". The directory "doc/" contains the corpus documentation in various formats.

Note: Matlab code for extracting pykfec coefficients is available. See below for details. ARFF format files containing MFCCs for the 8, 16 and 32 speaker databases have now been made available. The links to these are at the bottom of the page.

Corpus Files

Solo condition
solo.tar.7z.001 (100 Mb)
solo.tar.7z.002 (100 Mb)
solo.tar.7z.003 (100 Mb)
solo.tar.7z.004 (100 Mb)
solo.tar.7z.005 (75 Mb)
Synchronous Condition
sync.tar.7z.001 (100 Mb)
sync.tar.7z.002 (100 Mb)
sync.tar.7z.003 (100 Mb)
sync.tar.7z.004 (100 Mb)
sync.tar.7z.005 (100 Mb)
sync.tar.7z.006 (68 Mb)
Retelling Condition
retell.tar.7z.001 (71 Mb)
Repetitive Synchronous Imitation (stereo files: speaker and target)
rsi.tar.7z.001 (100 Mb)
rsi.tar.7z.002 (100 Mb)
rsi.tar.7z.003 (56 Mb)
Fast Speech
fast.tar.7z.001 (100 Mb)
fast.tar.7z.002 (100 Mb)
fast.tar.7z.003 (100 Mb)
fast.tar.7z.004 (16 Mb)
Whispered Speech
whsp.tar.7z.001 (100 Mb)
whsp.tar.7z.002 (100 Mb)
whsp.tar.7z.003 (100 Mb)
whsp.tar.7z.004 (100 Mb)
whsp.tar.7z.005 (37 Mb)

The tar gzipped files contain datasets extracted from a selection of speakers and speech data from the corpus

Pykfec extraction

The CHAINS corpus formed the basis for work in Speaker Identification using Instantaneous Frequencies. This work is described in Grimaldi, M. and Cummins, F. (2008). Speaker Identification Using Instantaneous Frequencies. IEEE Transactions on Audio, Speech, and Language Processing, 16(6):1097-1111. The features used are referred to as pykfec

We here provide a reference implementation for PYKFEC extraction in the form of a Matlab/Octave toolbox. Please read the file README for basic instructions on how to use the software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

The software is available here


ARFF datasets

The datasets are in WEKA arff format (http://sourceforge.net/projects/weka/) and contain speech encoded using 25 MFCC (the coefficient are extracted using a reference implementation that can be downloaded from: http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/) The header of each dataset contains information about the speech data and speaker used and brief instructions on how to use the MFCC reference implementation to reproduce the values.

8 speaker set (18 Mb)
16 speaker set (35 Mb)
36 speaker set (59 Mb)

If you have problems downloading these, you can try these http links instead:

8 speaker set (18 Mb)
16 speaker set (35 Mb)
36 speaker set (59 Mb)