TOURSG: A RESEARCH DIALOGUE CORPUS IN THE TOURISTIC DOMAIN

TourSG consists of dialogue sessions on touristic information for Singapore. It was collected from Skype calls between actual tour guides and tourists. Two collections are available: one in English (EN-TourSG) and one in Chinese (ZH-TourSG). EN-TourSG comprises 35 dialogue sessions and ZH-TourSG comprises 36 dialogue sessions, with a total length of 21 hours of conversations per language. 

All dialogue sessions have been manually transcribed and annotated with speech act and semantic labels at the turn level. Annotations at the sub-dialogue segment level are also available (each full dialogue session is divided into sub-dialogues considering their topical coherence; each sub-dialogue is assigned to a major topic category and annotated with an additional frame structure with slot value pairs to represent the subject discussed within the sub-dialogue).
EN-TourSG and ZH-TourSG have been used as evaluation data for the Fourth and Fifth Dialogue State Tracking Challenges (DSTC4 [1] and DSTC5 [2] ).
Basic statistics of the datasets:
 Language Dialogues UtterancesWords /
Characters  
 Total Duration

 English  35 31,034  273,580 words  21 hours
Chinese 36 54,464  492,711 characters 21 hours

[1] Seokhwan Kim, Luis Fernando D'Haro, Rafael E. Banchs, Jason D. Williams, Matthew Henderson, The Fourth Dialog State Tracking Challenge. Proceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS 2016), Saariselkä, Jan 2016
[2] Seokhwan Kim, Luis Fernando D'Haro, Rafael E. Banchs, Jason D. Williams, Matthew Henderson, Koichiro Yoshino, The Fifth Dialog State Tracking Challenge.  Proceedings of the 2016 IEEE Spoken Language Technology Workshop (SLT 2016), San Diego, Dec 2016