Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train a multilingual model, is there a script for it? #1656

Closed
zds-potato opened this issue Jan 10, 2023 · 1 comment
Closed

How to train a multilingual model, is there a script for it? #1656

zds-potato opened this issue Jan 10, 2023 · 1 comment

Comments

@zds-potato
Copy link

I see that the w2v-conformer pre-trained model is trained using a multilingual dataset. Currently I have not found a relevant multilingual training solution or script.

Some of the problems encountered so far are how to choose the text modeling unit, is it BPE or char or something else?

@Emiyassstar
Copy link
Collaborator

w2v-conformer don't use any text information to calculate pretrain loss . But in order not to change the wenet training pipeline,you can fill in any text unit just like 'A' for multilingual wavs.
For multilingual training, You can merge all wavs into one dataset and balance data refer to Facebook's XLSR model.
For wav2vec training example ,you can see #1003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants