Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funnel-Transformer #1156

Merged
merged 9 commits into from
Nov 20, 2020
Merged

Funnel-Transformer #1156

merged 9 commits into from
Nov 20, 2020

Conversation

mtyrolski
Copy link
Contributor

@google-cla
Copy link

google-cla bot commented Oct 27, 2020

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added the cla: no label Oct 27, 2020
@mtyrolski
Copy link
Contributor Author

@googlebot I signed it!

@google-cla google-cla bot added cla: yes and removed cla: no labels Oct 28, 2020
@google-cla
Copy link

google-cla bot commented Oct 28, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no and removed cla: yes labels Oct 28, 2020
@mtyrolski
Copy link
Contributor Author

@googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 28, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@syzymon
Copy link
Contributor

syzymon commented Oct 28, 2020

@googlebot I consent.

@google-cla
Copy link

google-cla bot commented Oct 28, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@shadowatyyy
Copy link
Contributor

@googlebot I consent.

@google-cla google-cla bot added cla: yes and removed cla: no labels Oct 28, 2020
@afrozenator afrozenator added the ready to pull Added when the PR is ready to be merged. label Nov 2, 2020
* replaced trival assert

* Fix style

Co-authored-by: syzymon <s.tworkowski@student.uw.edu.pl>
@google-cla
Copy link

google-cla bot commented Nov 5, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no and removed cla: yes labels Nov 5, 2020
@syzymon
Copy link
Contributor

syzymon commented Nov 5, 2020

@googlebot I fixed it.

@google-cla google-cla bot added cla: yes and removed cla: no labels Nov 5, 2020
Copy link
Collaborator

@sebastianjaszczur sebastianjaszczur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pull request, it looks good overall!
I have some important comments about the code, however.

Also, there are some linter errors; Afroz has shared the config we use for linter in #1182 (this will be merged later). Can you run linter with this config and fix errors?

)
),
tl.Concatenate(axis=1)
) if separate_cls else pool_layer(pool_size, strides)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see here if-else statement rather than conditional expression; this Serial is quite long anyway. Can you split it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def _Upsample(short, masks, long):
factor = -(-long.shape[1] // short.shape[1]) # ceil division
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works as upsampling; short counterexample below. I think it would be reasonable just to assert that input length is always divisible by pool_size wherever we use pooling (input length is usually a power of two anyway). Usually the input length in training/evaluation is a power of 2 anyway, I think.

Counterexample: input_size=31, pool_size=2, single decoding block. AvgPool with default padding ("valid", not "same") will produce tensor of length 15. After this downsampling block we will have upsampling with factor = 3. This is out of sync with pool_size - we will get tokens mapped to wrong positions in "short.repeat".
While this counterexample would be fixed by changing the padding to "same", I have a feeling that we should just assert that input length at every stage will be divisible by pool_size. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserting that input length is divisible by pool_size would be fine for FunnelTransformer with upsampling, but we use the same pooler for the version without it. We could modify pooler to consider whether its output will be upsampled later, but I'm not sure if it's worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be done fairly easily is to insert in _Upsample function check that "if long.shape[1] % short.shape[1] != 0: raise ValueError('message')." Then you don't need to modify anything but _Upsample function; downsampling isn't touched at all, and we can be sure that upsampling works correctly.
The current issue I have with this _Upsample function is that it computes wrong results (see my counterexample), and I think that throwing an exception is much better than silently returning wrong results.

Adding this check is also much easier than correcting the implementation - this would involve passing a pool_size/stride to _Upsample to be used in place of 'factor', and some padding (see the counterexample in my previous comment). I think that correcting the implementation may not be worth the effort, but adding an assert is worth it.

(Also, while it isn't necessary, I would consider adding this kind of assert even during downsampling. Let's consider the case when we have only downsampling, with pool_size=2, and input length not divisible by 2. Current downsampling simply throws away the last token (due to a padding "valid", which is the default), which is kind of a strange behaviour.)

def _Upsample(short, masks, long):
factor = -(-long.shape[1] // short.shape[1]) # ceil division
new_vecs = long + short.repeat(factor, axis=1)[:, :long.shape[1], :]
new_masks = masks.repeat(factor, axis=-1)[:, :, :, :long.shape[1]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think masks shouldn't be upsampled like this - we could just use original masks instead of downsampling and upsampling them, which may introduce errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pool_layer, pool_size, strides, separate_cls):
"""Internal funnel block. On input it takes (activations, masks).

Args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include arguments "pool_layer", "separate_cls" in the description?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tl.Parallel(
None,
tl.Fn('mask_max_pool',
_InternalMaxPool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pool layer used for masks doesn't have pool_size/strides synchronized with pool layer for attention. So, setting pool_size/strides different than 2 will not work as I understand it.
Can you this pool layer to change mask according to pool_size?
(This comment applies also to _FunnelResidualBlock. )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool_size/strides in this mask pool wasn't synchronized with PoolLayer indeed. I fixed this by replacing this _InternalMaxPool by a new MaskPool layer, which utilizes PoolLayer internally so that now it can use pool_size and strides different than 2.

)


def FunnelTransformerEncoder(vocab_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation to those models, along with argument description like in _FunnelBlock? This is applicable also to _FunnelResidualBlock and FunnelTransformer .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings have been added.

feed_forward = _FeedForwardBlock(
d_model, d_ff, dropout, dropout_shared_axes, mode, ff_activation)

dropout_ = tl.Dropout(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do variables have "_" suffix here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - dropout_ layer variable name was taken from the original TransformerEncoder to avoid shadowing dropout rate argument, replaced by a more meaningful hidden_dropout.

pooling_ = PoolLayer(pool_layer, pool_size, strides)

return [
tl.Parallel(tl.Branch(pooling_, None), None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very important, but can this be replaced by Select + pooling? I think it will be clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Parallel actually looks like a no-op so I removed it, but I would prefer not to apply pooling inside the residual with attention (Select is used there to split into Q, K, V).

]


def FunnelTransformer(vocab_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to FunnelTransformerDecoder, to keep naming consistent with models/transformer.py ? It seems closer to TransformerDecoder than Transformer, since the former outputs an embedding per token (like this Funnel class) and the latter predicts a class per token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our previous discussion, we changed the FunnelTransformer to output token-level categorical distribution over vocab instead of embeddings, which makes it useful for example as a BERT.

FunnelTransformer


class FunnelTransformerTest(parameterized.TestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change all tests to use, approximately, the smallest possible models? E.g. d_model=8, d_ff=8, n_layers=2 etc. This would speed up tests very significantly while testing the same functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed these parameters in tests, now they run much faster.

* Split PoolLayer

* Adjust model sizes in unit tests to make them faster

* Replace InternalMaxPool with more generic MaskPool

* Cls token test

* Fix formatting

* Fix formatting #2

* Cls pooling fix

* Remove unnecessary unpacking

* Remove unused strides parameter
@afrozenator afrozenator removed the ready to pull Added when the PR is ready to be merged. label Nov 9, 2020
shadowatyyy and others added 2 commits November 11, 2020 21:22
* keep original masks instead of upsampling

* fix typo
* Rename variables in funnel residual block

* Add docs

* Fix dropout shadow variable

* Funnel residual block refactor

* FunnelTransformer output change, make residual FunnelBlock default + more docs

* Docstring fix
Copy link
Collaborator

@sebastianjaszczur sebastianjaszczur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I only wanted to follow up on implementation of _Upsample. Everything else looks good to me.



def _Upsample(short, masks, long):
factor = -(-long.shape[1] // short.shape[1]) # ceil division
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be done fairly easily is to insert in _Upsample function check that "if long.shape[1] % short.shape[1] != 0: raise ValueError('message')." Then you don't need to modify anything but _Upsample function; downsampling isn't touched at all, and we can be sure that upsampling works correctly.
The current issue I have with this _Upsample function is that it computes wrong results (see my counterexample), and I think that throwing an exception is much better than silently returning wrong results.

Adding this check is also much easier than correcting the implementation - this would involve passing a pool_size/stride to _Upsample to be used in place of 'factor', and some padding (see the counterexample in my previous comment). I think that correcting the implementation may not be worth the effort, but adding an assert is worth it.

(Also, while it isn't necessary, I would consider adding this kind of assert even during downsampling. Let's consider the case when we have only downsampling, with pool_size=2, and input length not divisible by 2. Current downsampling simply throws away the last token (due to a padding "valid", which is the default), which is kind of a strange behaviour.)

* new upsampler

* fix comments

* missing endline

* Fix pep8

* Add unit test for upsampling

* review fixes

* Reformat

Co-authored-by: syzymon <s.tworkowski@student.uw.edu.pl>
@sebastianjaszczur sebastianjaszczur added the ready to pull Added when the PR is ready to be merged. label Nov 19, 2020
@copybara-service copybara-service bot merged commit d10f3a0 into google:master Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull Added when the PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants