Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Quality EN-DE/EN-FR Translators #5419

Closed
sshleifer opened this issue Jun 30, 2020 · 8 comments · Fixed by #6940
Closed

High Quality EN-DE/EN-FR Translators #5419

sshleifer opened this issue Jun 30, 2020 · 8 comments · Fixed by #6940
Labels
Help wanted Extra attention is needed, help appreciated New model translation machine translation utilities and models

Comments

@sshleifer
Copy link
Contributor

sshleifer commented Jun 30, 2020

Download instructions from torchub/fairseq: here
the BART conversion script should be reusable.

Open source status

  • [ x] the model implementation is available: (give details)
  • [ x] the model weights are available: (give details)
  • [ x] who are the authors: (mention them, if possible by @gh-username)
    Sergey Edunov, @myleott Michael Auli, David Grangier

Paper: https://arxiv.org/pdf/1808.09381.pdf

Spec

Desired API:

mname = 'facebook/wmt-en-de'



model = FairseqTranslator.from_pretrained(mname)
tokenizer = FairseqBPETokenizer.from_pretrained(mname)  # AutoTokenizer should also work
batch = tokenizer.prepare_seq2seq_batch(['Maschinelles Lernen ist großartig!'])
translated = model.generate(**batch) # determine 
assert tokenizer.batch_decode(translated)[0] == 'Machine Learning is great'
  • add .rst docs, (see adding a new model instructions, but don't follow them too religiously if something seems suboptimal).
  • check timing, memory vs fairseq.
  • if lots of modeling code is added, common tests should pass.

Steps

  1. Get tokenizer equivalence (The fairseq object should have an encode method, and there should be wgettable links of fairseq to get the relevant tokenizer files).
    1b. Upload tokenizer to s3 so your tokenizer tests work on CI. You can work out of the stas/fairseq-en-de namespace on your modelhub account and then move everything over (or not) at the end.
  2. Get model.forward/ "logits" equivalence (ignore differences less than 1e-6). This usually doesn't work the first time and you have to go line by line with two ipdb sessions (one fairseq, one hf) until you can find the line that's different. At this stage you should worry very little about code quality and just try to get integration tests passing.
  3. Get model.generate/ "translation" equivalence. There may be small beam search discrepancies. For this you will need to figure out decoder_start_token_id, num_beams, and other config settings.
  4. Upload Everything to S3.
  5. Go through template
    and make sure most of the reasonable things are done.
    At this point a full integration test (as above) should pass.
  6. Check memory, time and BLEU against fairseq (ideally in collab). Improve/document results in PR description.
  7. test the scary parts: special tokens, padding insensitivity.
  8. Docs/AutoConfig Etc.
    Helpful: https://huggingface.co/transformers/model_sharing.html

Assigned to: @stas00

@sshleifer sshleifer added New model translation machine translation utilities and models labels Jun 30, 2020
@sshleifer sshleifer added the Help wanted Extra attention is needed, help appreciated label Jun 30, 2020
@cp-pc
Copy link

cp-pc commented Jul 4, 2020

Excuse me.
Will this model be added in the future, how long will it take?
Is currently only T5 and Bart can do machine translation?

@sshleifer
Copy link
Contributor Author

I would guess that I get around to this by the end of July, but I can't be sure.

We also have MarianMTModel and 1000+ pretrained weights from Helsinki-NLP/ that do translation. Here is the list:
https://huggingface.co/Helsinki-NLP

@stas00
Copy link
Contributor

stas00 commented Aug 15, 2020

I will work on this one.

@stas00
Copy link
Contributor

stas00 commented Aug 18, 2020

Here is a lazy man's implementation that uses a simple proxy to the fairseq implementation and makes the spec test pass:

import torch

class FairseqProxy():
    def __init__(self, module):
        self.module = module
        
    @classmethod
    def from_pretrained(cls, mname): 
        return cls(module=torch.hub.load('pytorch/fairseq', mname, checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt', tokenizer='moses', bpe='fastbpe'))

class FairseqTranslator(FairseqProxy):
    
    def generate(self, **tokenized_sentences):
        return self.module.generate(tokenized_sentences['data'])
    
class FairseqBPETokenizer(FairseqProxy):

    def prepare_seq2seq_batch(self, sentences): # encode
        return {'data': [self.module.encode(sentence) for sentence in sentences]}
    
    def batch_decode(self, batched_hypos):
        return [self.module.decode(hypos[0]['tokens']) for hypos in batched_hypos]
# Look ma, I cheated and the test passes ;)
mname = 'transformer.wmt19.ru-en'
model = FairseqTranslator.from_pretrained(mname)
tokenizer = FairseqBPETokenizer.from_pretrained(mname)
batch = tokenizer.prepare_seq2seq_batch(["Машинное обучение - это здорово!"])
translated = model.generate(**batch)
assert tokenizer.batch_decode(translated)[0] == 'Machine learning is great!'

Now to the real work of porting...

@stas00
Copy link
Contributor

stas00 commented Sep 4, 2020

mostly done: #6940

@sshleifer sshleifer linked a pull request Sep 4, 2020 that will close this issue
@stas00
Copy link
Contributor

stas00 commented Sep 15, 2020

once #6940 is merged this issue is to be closed

@sshleifer
Copy link
Contributor Author

FYI, Linked Pull requests automatically close the linked issue.

@stas00
Copy link
Contributor

stas00 commented Sep 15, 2020

I noticed that you already did the linking after leaving the comment, but decided to leave it as the previous comment of mine wasn't certain ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed, help appreciated New model translation machine translation utilities and models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants