-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test #7224
Conversation
was needed for torchscript to work - this is now part of the state_dict, so will have to remove these keys during save_pretrained
Codecov Report
@@ Coverage Diff @@
## master #7224 +/- ##
==========================================
- Coverage 81.81% 81.57% -0.25%
==========================================
Files 174 174
Lines 33446 33448 +2
==========================================
- Hits 27364 27285 -79
- Misses 6082 6163 +81
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is wonderful! parametrized.expand
is a great find, and not saving static embeddings is an obvious win. We should add the latter to bart in a separate PR.
The expansion of embeddings may require a bit more care, but the comment below doesn't prevent merging this PR. You can just delete that logic later if it is bad.
Expanding Positional Embeddings
if max_pos > self.weight.size(0): # recompute/expand embeddings if needed
The reason I haven't autoexpanded the bart positional embeddings so far is that I wanted an error for long sequences that the model would translate poorly, instead of just poor performance. But if you can like concatenate a few en-ru examples and see that performance doesn't plummet it would be good. There is also a theoretical O(seq_len^2) theoretical cost associated with passing longer documents through transformers, so we may not want to encourage longer docs/instead write tooling that, for example, uses moses SentenceSplitter
to chunk documents, pass them through the model, and rejoin the results correctly. If it just works with the auto expansion hack then I'm all for it.
Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
This is how fairseq does it - this PR doesn't change the original behavior, so yes, it can be merged, as it fixes USE_CUDA=1 with torchscript situation. And further algorithmic changes will require a separate care. I moved the issue you raised to its own ticket: #7256 So let's continue discussing that suggestion there. Thank you for bringing it up, @sshleifer! |
Great, I will let the 2nd approving reviewer merge. |
Oh, your comment made me discover a bug in my porting - somehow I used vocab sizes as the number of positional embeddings - so it's not surprising the weights were 250MB - fixed now. I rechecked - fairseq inits them to the Or should I sync it with bart and just have context:
|
I split off the key naming discussion to #7258 - this is again not a show-stopper for this PR, as it impacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most elegant solution to the problem we can have, short of PyTorch supporting a new kind of weights. Thanks a lot for your work on this @stas00 !
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice change, would have saved us a lot of pain with the addition of position_ids
in several models. We should add these keys (the position IDs buffers) to the keys to never save in my opinion.
Or, as mentioned here: #6700 (comment), change the |
It won't work at the moment, since
|
Indeed, thanks for clarifying! |
These changes are in one PR as they all fix problems for
USE_CUDA=1
USE_CUDA=1
tests that got skipped previously (was missing.to(device)
)SinusoidalPositionalEmbedding
to be a normalnn.Embedding
subclass with a normalself.weight
param, but exclude this param from being saved with thestate_dict
since it's not trained, but deterministicPreTrainedModel.save_pretrained
to support models that don't want all of their params saved. (needed for fsmt'sSinusoidalPositionalEmbedding
)TranslationPipeline
(well, this one is just a new test)@sshleifer
This includes fixing: #7229