RSR2015 OVERVIEW & SPECIFICATIONS

The RSR2015 speaker verification corpus supports development, training and testing of automatic text-dependent speaker verification systems. Text-dependent speaker verification raises a growing interest in the industrial and scientific community. The potential increase of accuracy makes text-dependent voice biometrics a promising field for remote authentication.

The RSR2015 corpus aims at providing the scientific community with a text-dependent database allowing different types of protocols. For this purpose, it includes utterances of different lengths, short pass-phrases as well as random digit sequences.

RSR2015 involved 300 speakers (157 male, 143 female) and for each speaker, there were 3 enrolment sessions of 73 utterances each and 6 verification sessions of 73 utterances each, for a total of 657 utterances in 9 sessions per speaker. All sessions were recorded using portable devices (handphones and tablets.)

Corpus speaker distribution
RSR2015 contains a total of 197,100 utterances – 657 utterances spoken by each of the 300 speakers. The distribution of the ethnic origins of the speakers closely resembles that of Singapore’s population. (See Table 1.)

Table 1 - Ethnic origins of speakers

  
 Female  Male Total
 Chinese 118141979% 
 Malay  1428 14% 
 Others  1110 7% 

Speakers are aged between 17 and 42 and the distribution is shown in Figure 1.
RSR

Figure 1. Speaker distribution according to age.

Hardware

Five different portable devices were used for recording. The list of these devices is:

  • 1 Samsung Nexus
  • 2 Samsung Galaxy S
  • 1 Samsung Tab
  • 1 HTC Desire

Utterances from each user were recorded using at least three different devices. (A list of the devices used by session is provided with the corpus.)

The sample rate for the speech files is 16 kHz, and the sample coding is 16 bits linear. Speech files are stored in raw format.

Corpus Text Material

The text material in the RSR2015 prompts (found in the file RSR2015_prompts.pdf) is comprised of four groups of utterances:

Group 1:
30 phonetically-balanced sentences taken from the TIMIT corpus. These sentences have been selected to provide a good coverage of phones. Duration of these sentences after Voice Activity Detection vary from 0.13s to 2.11s, with a mean of 0.93s.

Group 2:
30 short commands used to control home appliances in the StarHome, a fully functional smart-home prototype located at Fusionopolis in Singapore. Duration of these commands after Voice Activity Detection vary from 0.05s to 1.00s, with a mean of 0.43s.

Group 3:
3 series of 10 random digits. Each series contains one occurrence of each digit. These sequences differ between sessions and are common between speakers to allow for impostor attack scenario.

Group 4:
10 series of 5 random digits. Each series comprises 5 unique, randomly-selected digits (i.e. the 5 digits making up a sequence are different). These sequences differ between sessions and are common between speakers to allow for impostor attack scenario.