-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6. Gather more multi-lingual data #11
Comments
Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech |
I think we need native speakers to ensure high quality material and build the best global open source TTS system. I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training. |
Hi Native Arabic speaker here. Just ping me once you're ready. |
Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.
The text was updated successfully, but these errors were encountered: