Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test #7224

Merged
merged 18 commits into from
Sep 21, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Sep 17, 2020

These changes are in one PR as they all fix problems for USE_CUDA=1

  • fix a few USE_CUDA=1 tests that got skipped previously (was missing .to(device))
  • rewrite SinusoidalPositionalEmbedding to be a normal nn.Embedding subclass with a normal self.weight param, but exclude this param from being saved with the state_dict since it's not trained, but deterministic
  • adjust PreTrainedModel.save_pretrained to support models that don't want all of their params saved. (needed for fsmt's SinusoidalPositionalEmbedding)
  • add a new test using TranslationPipeline (well, this one is just a new test)

@sshleifer

This includes fixing: #7229

@codecov
Copy link

codecov bot commented Sep 19, 2020

Codecov Report

Merging #7224 into master will decrease coverage by 0.24%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7224      +/-   ##
==========================================
- Coverage   81.81%   81.57%   -0.25%     
==========================================
  Files         174      174              
  Lines       33446    33448       +2     
==========================================
- Hits        27364    27285      -79     
- Misses       6082     6163      +81     
Impacted Files Coverage Δ
src/transformers/modeling_utils.py 86.96% <83.33%> (-0.27%) ⬇️
src/transformers/modeling_fsmt.py 93.99% <100.00%> (+0.40%) ⬆️
src/transformers/modeling_tf_electra.py 25.32% <0.00%> (-73.63%) ⬇️
src/transformers/modeling_tf_roberta.py 45.63% <0.00%> (-47.62%) ⬇️
src/transformers/modeling_tf_ctrl.py 83.74% <0.00%> (-14.14%) ⬇️
src/transformers/modeling_gpt2.py 79.03% <0.00%> (-7.80%) ⬇️
src/transformers/generation_tf_utils.py 82.95% <0.00%> (-3.51%) ⬇️
src/transformers/modeling_ctrl.py 95.16% <0.00%> (-2.42%) ⬇️
src/transformers/modeling_tf_utils.py 86.03% <0.00%> (-1.30%) ⬇️
...rc/transformers/data/datasets/language_modeling.py 92.94% <0.00%> (-1.18%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d90d0f...1b80514. Read the comment docs.

@stas00 stas00 changed the title [wip] fsmt tests improvements [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes Sep 19, 2020
@stas00 stas00 changed the title [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes [fsmt] rewrite SinusoidalPositionalEmbedding + USE_CUDA test fixes + new TranslationPipeline test Sep 19, 2020
Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is wonderful! parametrized.expand is a great find, and not saving static embeddings is an obvious win. We should add the latter to bart in a separate PR.

The expansion of embeddings may require a bit more care, but the comment below doesn't prevent merging this PR. You can just delete that logic later if it is bad.

Expanding Positional Embeddings

if max_pos > self.weight.size(0): # recompute/expand embeddings if needed

The reason I haven't autoexpanded the bart positional embeddings so far is that I wanted an error for long sequences that the model would translate poorly, instead of just poor performance. But if you can like concatenate a few en-ru examples and see that performance doesn't plummet it would be good. There is also a theoretical O(seq_len^2) theoretical cost associated with passing longer documents through transformers, so we may not want to encourage longer docs/instead write tooling that, for example, uses moses SentenceSplitter to chunk documents, pass them through the model, and rejoin the results correctly. If it just works with the auto expansion hack then I'm all for it.

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
@stas00
Copy link
Contributor Author

stas00 commented Sep 19, 2020

The expansion of embeddings may require a bit more care, but the comment below doesn't prevent merging this PR. You can just delete that logic later if it is bad.

This is how fairseq does it - this PR doesn't change the original behavior, so yes, it can be merged, as it fixes USE_CUDA=1 with torchscript situation. And further algorithmic changes will require a separate care. I moved the issue you raised to its own ticket: #7256

So let's continue discussing that suggestion there. Thank you for bringing it up, @sshleifer!

@sshleifer
Copy link
Contributor

Great, I will let the 2nd approving reviewer merge.

@stas00
Copy link
Contributor Author

stas00 commented Sep 19, 2020

Oh, your comment made me discover a bug in my porting - somehow I used vocab sizes as the number of positional embeddings - so it's not surprising the weights were 250MB - fixed now.

I rechecked - fairseq inits them to the config.max_position_embeddings + self.padding_idx + 1 - so we have 2 extras there.

Or should I sync it with bart and just have config.max_position_embeddings, so it's consistent across transformers?

context:

        self.embed_positions = SinusoidalPositionalEmbedding(
            config.max_position_embeddings + self.padding_idx + 1, embed_dim, self.padding_idx
        )

@sshleifer?

@stas00
Copy link
Contributor Author

stas00 commented Sep 19, 2020

I split off the key naming discussion to #7258 - this is again not a show-stopper for this PR, as it impacts modeling_bart.py too.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most elegant solution to the problem we can have, short of PyTorch supporting a new kind of weights. Thanks a lot for your work on this @stas00 !

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change, would have saved us a lot of pain with the addition of position_ids in several models. We should add these keys (the position IDs buffers) to the keys to never save in my opinion.

@LysandreJik
Copy link
Member

Or, as mentioned here: #6700 (comment), change the persistent attribute to False for the registered buffers.

@stas00
Copy link
Contributor Author

stas00 commented Sep 21, 2020

Or, as mentioned here: #6700 (comment), change the persistent attribute to False for the registered buffers.

It won't work at the moment, since

  1. this functionality was added just a few months ago (can't require recent torch)
  2. it doesn't yet work with torchscript

@LysandreJik
Copy link
Member

Indeed, thanks for clarifying!

@LysandreJik LysandreJik merged commit 8ff88d2 into huggingface:master Sep 21, 2020
@stas00 stas00 deleted the fsmt5 branch September 21, 2020 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants