6. Gather more multi-lingual data #11

jpc · 2023-03-29T08:20:05Z

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

faceair · 2024-01-18T15:38:43Z

Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech

jpc · 2024-01-22T14:28:28Z

I think we need native speakers to ensure high quality material and build the best global open source TTS system.

I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training.

mush42 · 2024-01-30T02:19:21Z

Hi

Native Arabic speaker here. Just ping me once you're ready.

jpc added the goal Main sub-tasks of the project label Mar 29, 2023

jpc mentioned this issue Mar 29, 2023

7. Train the final models #12

Closed

jpc changed the title ~~6. Gather more data~~ 6. Gather more multi-lingual data Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6. Gather more multi-lingual data #11

6. Gather more multi-lingual data #11

jpc commented Mar 29, 2023

faceair commented Jan 18, 2024

jpc commented Jan 22, 2024

mush42 commented Jan 30, 2024

6. Gather more multi-lingual data #11

6. Gather more multi-lingual data #11

Comments

jpc commented Mar 29, 2023

faceair commented Jan 18, 2024

jpc commented Jan 22, 2024

mush42 commented Jan 30, 2024