For performance evaluation, we extract two subsets of the SEAME data as test sets: one is dominated by Mandarin speech (denoted as dev_man) while the other (denoted as dev_sge) is dominated by Singapore English. Each test set contains 10 speakers with balanced genders.
Speakers | hours | |
---|---|---|
train | 134 | 101.13 |
dev_man | 10 | 7.49 |
dev_sge | 10 | 3.93 |
We only shared the train wav file list which you can see in LDC2015S04. Please contact me if you have any questions (zengzp0912@gmail.com).
[1] Dau-Cheng Lyu, Tien Ping Tan, Eng siong Chng, and Hai zhou Li,“SEAME:a mandarin-english code-switching speech corpus in south-east asia.,” in INTERSPEECH, 2010, vol. 10, pp. 1986–1989.
[2] Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, and Haizhou Li, “On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition,” arXiv preprint arXiv:1811.00241, 2018.