You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have some questions about the training details of LASER. In Appendix A it is stated that:
OpenSubtitles2018: A parallel corpus of movie subtitles in 57 languages. The corpus size varies from a few thousand sentences to more than 50 million. We keep at most 2 million entries for each language pair.
I'm wondering if in this case we keep 2 million for each locale, for a total of 4 million for Chinese and 4 million for Portuguese, or do we pick 1 million for each locale for a total of 2 million per language.
In addition, how are the 2 million sentences sampled? Is it just the first 2 million for each language pair?
Thank you!
The text was updated successfully, but these errors were encountered:
Hi, I have some questions about the training details of LASER. In Appendix A it is stated that:
For Chinese and Portuguese, there are separate entries depending on the locale: http://opus.nlpl.eu/OpenSubtitles.php
I'm wondering if in this case we keep 2 million for each locale, for a total of 4 million for Chinese and 4 million for Portuguese, or do we pick 1 million for each locale for a total of 2 million per language.
In addition, how are the 2 million sentences sampled? Is it just the first 2 million for each language pair?
Thank you!
The text was updated successfully, but these errors were encountered: