-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Funnel-Transformer #1156
Funnel-Transformer #1156
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
FunnelBlock and FunnelEncoder from https://arxiv.org/pdf/2006.03236.pdf
… basic unit tests (#2)
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
* replaced trival assert * Fix style Co-authored-by: syzymon <s.tworkowski@student.uw.edu.pl>
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I fixed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pull request, it looks good overall!
I have some important comments about the code, however.
Also, there are some linter errors; Afroz has shared the config we use for linter in #1182 (this will be merged later). Can you run linter with this config and fix errors?
) | ||
), | ||
tl.Concatenate(axis=1) | ||
) if separate_cls else pool_layer(pool_size, strides) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to see here if-else statement rather than conditional expression; this Serial is quite long anyway. Can you split it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
def _Upsample(short, masks, long): | ||
factor = -(-long.shape[1] // short.shape[1]) # ceil division |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works as upsampling; short counterexample below. I think it would be reasonable just to assert that input length is always divisible by pool_size wherever we use pooling (input length is usually a power of two anyway). Usually the input length in training/evaluation is a power of 2 anyway, I think.
Counterexample: input_size=31, pool_size=2, single decoding block. AvgPool with default padding ("valid", not "same") will produce tensor of length 15. After this downsampling block we will have upsampling with factor = 3. This is out of sync with pool_size - we will get tokens mapped to wrong positions in "short.repeat".
While this counterexample would be fixed by changing the padding to "same", I have a feeling that we should just assert that input length at every stage will be divisible by pool_size. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserting that input length is divisible by pool_size would be fine for FunnelTransformer with upsampling, but we use the same pooler for the version without it. We could modify pooler to consider whether its output will be upsampled later, but I'm not sure if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be done fairly easily is to insert in _Upsample function check that "if long.shape[1] % short.shape[1] != 0: raise ValueError('message')." Then you don't need to modify anything but _Upsample function; downsampling isn't touched at all, and we can be sure that upsampling works correctly.
The current issue I have with this _Upsample function is that it computes wrong results (see my counterexample), and I think that throwing an exception is much better than silently returning wrong results.
Adding this check is also much easier than correcting the implementation - this would involve passing a pool_size/stride to _Upsample to be used in place of 'factor', and some padding (see the counterexample in my previous comment). I think that correcting the implementation may not be worth the effort, but adding an assert is worth it.
(Also, while it isn't necessary, I would consider adding this kind of assert even during downsampling. Let's consider the case when we have only downsampling, with pool_size=2, and input length not divisible by 2. Current downsampling simply throws away the last token (due to a padding "valid", which is the default), which is kind of a strange behaviour.)
def _Upsample(short, masks, long): | ||
factor = -(-long.shape[1] // short.shape[1]) # ceil division | ||
new_vecs = long + short.repeat(factor, axis=1)[:, :long.shape[1], :] | ||
new_masks = masks.repeat(factor, axis=-1)[:, :, :, :long.shape[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think masks shouldn't be upsampled like this - we could just use original masks instead of downsampling and upsampling them, which may introduce errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pool_layer, pool_size, strides, separate_cls): | ||
"""Internal funnel block. On input it takes (activations, masks). | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include arguments "pool_layer", "separate_cls" in the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tl.Parallel( | ||
None, | ||
tl.Fn('mask_max_pool', | ||
_InternalMaxPool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pool layer used for masks doesn't have pool_size/strides synchronized with pool layer for attention. So, setting pool_size/strides different than 2 will not work as I understand it.
Can you this pool layer to change mask according to pool_size?
(This comment applies also to _FunnelResidualBlock. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool_size/strides in this mask pool wasn't synchronized with PoolLayer
indeed. I fixed this by replacing this _InternalMaxPool
by a new MaskPool
layer, which utilizes PoolLayer
internally so that now it can use pool_size and strides different than 2.
) | ||
|
||
|
||
def FunnelTransformerEncoder(vocab_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation to those models, along with argument description like in _FunnelBlock? This is applicable also to _FunnelResidualBlock and FunnelTransformer .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings have been added.
feed_forward = _FeedForwardBlock( | ||
d_model, d_ff, dropout, dropout_shared_axes, mode, ff_activation) | ||
|
||
dropout_ = tl.Dropout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do variables have "_" suffix here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - dropout_
layer variable name was taken from the original TransformerEncoder to avoid shadowing dropout
rate argument, replaced by a more meaningful hidden_dropout
.
pooling_ = PoolLayer(pool_layer, pool_size, strides) | ||
|
||
return [ | ||
tl.Parallel(tl.Branch(pooling_, None), None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very important, but can this be replaced by Select + pooling? I think it will be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Parallel actually looks like a no-op so I removed it, but I would prefer not to apply pooling inside the residual with attention (Select is used there to split into Q, K, V).
] | ||
|
||
|
||
def FunnelTransformer(vocab_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it to FunnelTransformerDecoder, to keep naming consistent with models/transformer.py ? It seems closer to TransformerDecoder than Transformer, since the former outputs an embedding per token (like this Funnel class) and the latter predicts a class per token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our previous discussion, we changed the FunnelTransformer
to output token-level categorical distribution over vocab instead of embeddings, which makes it useful for example as a BERT.
FunnelTransformer | ||
|
||
|
||
class FunnelTransformerTest(parameterized.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change all tests to use, approximately, the smallest possible models? E.g. d_model=8, d_ff=8, n_layers=2 etc. This would speed up tests very significantly while testing the same functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed these parameters in tests, now they run much faster.
* Split PoolLayer * Adjust model sizes in unit tests to make them faster * Replace InternalMaxPool with more generic MaskPool * Cls token test * Fix formatting * Fix formatting #2 * Cls pooling fix * Remove unnecessary unpacking * Remove unused strides parameter
* keep original masks instead of upsampling * fix typo
* Rename variables in funnel residual block * Add docs * Fix dropout shadow variable * Funnel residual block refactor * FunnelTransformer output change, make residual FunnelBlock default + more docs * Docstring fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I only wanted to follow up on implementation of _Upsample. Everything else looks good to me.
|
||
|
||
def _Upsample(short, masks, long): | ||
factor = -(-long.shape[1] // short.shape[1]) # ceil division |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be done fairly easily is to insert in _Upsample function check that "if long.shape[1] % short.shape[1] != 0: raise ValueError('message')." Then you don't need to modify anything but _Upsample function; downsampling isn't touched at all, and we can be sure that upsampling works correctly.
The current issue I have with this _Upsample function is that it computes wrong results (see my counterexample), and I think that throwing an exception is much better than silently returning wrong results.
Adding this check is also much easier than correcting the implementation - this would involve passing a pool_size/stride to _Upsample to be used in place of 'factor', and some padding (see the counterexample in my previous comment). I think that correcting the implementation may not be worth the effort, but adding an assert is worth it.
(Also, while it isn't necessary, I would consider adding this kind of assert even during downsampling. Let's consider the case when we have only downsampling, with pool_size=2, and input length not divisible by 2. Current downsampling simply throws away the last token (due to a padding "valid", which is the default), which is kind of a strange behaviour.)
* new upsampler * fix comments * missing endline * Fix pep8 * Add unit test for upsampling * review fixes * Reformat Co-authored-by: syzymon <s.tworkowski@student.uw.edu.pl>
Implementation of https://arxiv.org/abs/2006.03236