Can you provide the method to train using our own corpora using your version of fairseq ? #8

vishnu3741 · 2020-12-23T05:32:38Z

I normally use indicnlp to tokenize and moses to train the MT but your model is giving better accuracy and can you give an insight into the amount or corpus used to train the model? Thank you.

jerinphilip · 2020-12-25T13:47:56Z

Perhaps the paper linked below will answer the corpus used.

Revisiting Low Resource Status of Indian Languages in Machine Translation.

Regarding the data/training:

fairseq/data/cvit/corpora.py has dataset with a tag based inclusion, which I specify through a configuration file (example). An example training script in our cluster looks like this.

vishnu3741 · 2021-01-07T05:03:49Z

hey, Is there way to add vocabulary (I mean words) to the model instead of retraining the entire model? can we edit the files in mm-all-iter1 to do this?

jerinphilip · 2021-01-07T09:57:38Z

This paper might have some useful information, I think. I'd just retrain with new vocabulary, turnaround is approximately 1 day or something on 4 GPUs to start getting reasonable numbers. This one used 1080Tis or 2080Tis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you provide the method to train using our own corpora using your version of fairseq ? #8

Can you provide the method to train using our own corpora using your version of fairseq ? #8

vishnu3741 commented Dec 23, 2020 •

edited

Loading

jerinphilip commented Dec 25, 2020

vishnu3741 commented Jan 7, 2021 •

edited

Loading

jerinphilip commented Jan 7, 2021

Can you provide the method to train using our own corpora using your version of fairseq ? #8

Can you provide the method to train using our own corpora using your version of fairseq ? #8

Comments

vishnu3741 commented Dec 23, 2020 • edited Loading

jerinphilip commented Dec 25, 2020

vishnu3741 commented Jan 7, 2021 • edited Loading

jerinphilip commented Jan 7, 2021

vishnu3741 commented Dec 23, 2020 •

edited

Loading

vishnu3741 commented Jan 7, 2021 •

edited

Loading