Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6. Gather more multi-lingual data #11

Open
jpc opened this issue Mar 29, 2023 · 3 comments
Open

6. Gather more multi-lingual data #11

jpc opened this issue Mar 29, 2023 · 3 comments
Labels
goal Main sub-tasks of the project

Comments

@jpc
Copy link
Contributor

jpc commented Mar 29, 2023

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

@jpc jpc added the goal Main sub-tasks of the project label Mar 29, 2023
@faceair
Copy link

faceair commented Jan 18, 2024

Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech

@jpc
Copy link
Contributor Author

jpc commented Jan 22, 2024

I think we need native speakers to ensure high quality material and build the best global open source TTS system.

I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training.

@jpc jpc changed the title 6. Gather more data 6. Gather more multi-lingual data Jan 22, 2024
@mush42
Copy link

mush42 commented Jan 30, 2024

Hi

Native Arabic speaker here. Just ping me once you're ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal Main sub-tasks of the project
Development

No branches or pull requests

3 participants