Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[marian] converter supports models from new Tatoeba project #6342

Merged
merged 7 commits into from
Aug 18, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Aug 8, 2020

  • no state dict change, just need to read model metadata from a new path
  • we only accept models with 7 letter names... like ara-eng. This is the new format.
  • added integration test for ara-eng

Done:

  • upload 300 models: all new ones that were not "dominated" by a model we already have, where dominated means same langpair but the first model has a higher BLEU score.

Todo:

  • switch integration test for ara-eng -> ar-en.
  • automated model cards with correct tags, more info on all possible language codes.
  • automated conflict resolution: Don't convert models that are worse than predecessors.
  • decide what to do about naming: move all to 3 letter/all to 2 letter?
  • notebook -> pyfile
  • tweet

cc @julien-c

Dict where keys are old names, and values are new names, filtered to situations where new name has higher BLEU than old name:

{'bg-es': 'bul-spa',
 'es-eu': 'spa-eus',
 'eu-es': 'eus-spa',
 'es-bg': 'spa-bul',
 'ilo-en': 'ilo-eng',
 'es-mk': 'spa-mkd',
 'es-ca': 'spa-cat',
 'es-af': 'spa-afr',
 'lt-es': 'lit-spa',
 'bn-en': 'ben-eng',
 'th-en': 'tha-eng',
 'fr-ca': 'fra-cat',
 'ga-en': 'gle-eng',
 'en-ga': 'eng-gle',
 'ko-fi': 'kor-fin',
 'es-uk': 'spa-ukr',
 'gl-es': 'glg-spa',
 'eo-sv': 'epo-swe',
 'ca-de': 'cat-deu',
 'az-en': 'aze-eng',
 'sv-eo': 'swe-epo',
 'de-is': 'deu-isl',
 'ceb-en': 'ceb-eng',
 'ca-fr': 'cat-fra',
 'tl-en': 'tgl-eng',
 'is-de': 'isl-deu',
 'ko-en': 'kor-eng',
 'is-es': 'isl-spa',
 'es-gl': 'spa-glg',
 'bg-fr': 'bul-fra',
 'de-af': 'deu-afr',
 'ko-es': 'kor-spa',
 'es-is': 'spa-isl',
 'af-es': 'afr-spa',
 'gl-en': 'glg-eng',
 'fi-en': 'fin-eng',
 'en-bg': 'eng-bul',
 'mk-es': 'mkd-spa',
 'ka-en': 'kat-eng',
 'en-eu': 'eng-eus',
 'de-ca': 'deu-cat',
 'ar-de': 'ara-deu'}
 

@codecov
Copy link

codecov bot commented Aug 8, 2020

Codecov Report

Merging #6342 into master will increase coverage by 0.99%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6342      +/-   ##
==========================================
+ Coverage   78.42%   79.41%   +0.99%     
==========================================
  Files         156      156              
  Lines       28129    28129              
==========================================
+ Hits        22061    22340     +279     
+ Misses       6068     5789     -279     
Impacted Files Coverage Δ
src/transformers/modeling_tf_mobilebert.py 24.55% <0.00%> (-72.36%) ⬇️
src/transformers/generation_tf_utils.py 84.21% <0.00%> (-1.26%) ⬇️
src/transformers/modeling_tf_utils.py 86.31% <0.00%> (-0.98%) ⬇️
src/transformers/modeling_t5.py 83.71% <0.00%> (+0.18%) ⬆️
src/transformers/modeling_utils.py 87.35% <0.00%> (+0.19%) ⬆️
src/transformers/file_utils.py 82.44% <0.00%> (+0.25%) ⬆️
src/transformers/tokenization_utils.py 90.40% <0.00%> (+0.40%) ⬆️
src/transformers/generation_utils.py 96.94% <0.00%> (+0.83%) ⬆️
src/transformers/configuration_utils.py 96.59% <0.00%> (+1.36%) ⬆️
src/transformers/modeling_distilbert.py 97.82% <0.00%> (+1.63%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb7330b...c3288e2. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, incredibly simple diff for 300 new models, great stuff!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thanks for adding these!

@sshleifer sshleifer merged commit 12d7624 into huggingface:master Aug 18, 2020
@sshleifer sshleifer deleted the opus-convert-more branch August 18, 2020 03:55
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants