Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide the method to train using our own corpora using your version of fairseq ? #8

Open
vishnu3741 opened this issue Dec 23, 2020 · 3 comments

Comments

@vishnu3741
Copy link

vishnu3741 commented Dec 23, 2020

I normally use indicnlp to tokenize and moses to train the MT but your model is giving better accuracy and can you give an insight into the amount or corpus used to train the model? Thank you.

@jerinphilip
Copy link
Owner

Perhaps the paper linked below will answer the corpus used.

Regarding the data/training:

@vishnu3741
Copy link
Author

vishnu3741 commented Jan 7, 2021

hey, Is there way to add vocabulary (I mean words) to the model instead of retraining the entire model? can we edit the files in mm-all-iter1 to do this?

@jerinphilip
Copy link
Owner

This paper might have some useful information, I think. I'd just retrain with new vocabulary, turnaround is approximately 1 day or something on 4 GPUs to start getting reasonable numbers. This one used 1080Tis or 2080Tis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants