Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

funboarder13920
Copy link
Collaborator

@funboarder13920 funboarder13920 commented Feb 17, 2021

  • On LM tasks: copy src to target in tokenizer transforms otherwise it will be hazardous when tokenizers have random behaviors like dropout
  • remove dropout in tokenizers during validation mode
  • implement a "disjoin joiner with dropout" transform to make possible inference at any point in the sentence

else:
src_out = self.dropout_separate_joiner(example["src"], "src")
example["src"] = src_out
if self.opts.model_task == ModelTask.LANGUAGE_MODEL:
Copy link
Collaborator Author

@funboarder13920 funboarder13920 Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this won't work on build_vocab, model_task is a model parameter and not a dynamic corpus parameter.
Same issue in tokenizers.py

Copy link
Member

@francoishernandez francoishernandez Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_train should be false anyways in build_vocab, no?
EDIT: Nevermind, I mixed up with build_vocab_only.

if elem == SubwordMarker.JOINER:
continue
if elem.startswith(SubwordMarker.JOINER):
if random.random() < dropout:
Copy link
Collaborator Author

@funboarder13920 funboarder13920 Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the necessity of doing right and left token sides
It might create a difficulty to retrieve the initial token when detokenizing.
I can use a special token do distinguish left and right or remove the left side disjoin which doesn't occur much (mainly in punctuation)

Comment on lines 351 to 354
kwopts['bpe_dropout'] = subword_alpha if is_train else 0
elif subword_type == 'sentencepiece':
kwopts['sp_model_path'] = subword_model
kwopts['sp_nbest_size'] = subword_nbest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can directly reassign on subword_alpha & subword_nbest to disable both bpe_dropout/sp sampling.

Comment on lines -368 to -372
_diff_vocab = (
src_subword_kwargs.get('vocabulary_path', '') !=
tgt_subword_kwargs.get('vocabulary_path', '') or
src_subword_kwargs.get('vocabulary_threshold', 0) !=
tgt_subword_kwargs.get('vocabulary_threshold', 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer current one which seems more readable

Copy link
Contributor

@Zenglinxiao Zenglinxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit about the interest of this joiner disjoin dropout mechanism?
If we drop independent joiner, we may have trouble getting the original sentence when detokenize for some sequence. Also by randomly spliting left/right joiner from tokens will increase sequence length and cause more tokens falls in <unk>.
Is there any good reason to apply this despite all those potential limitations/conflicts? What's the objective of this proposal?

@francoishernandez
Copy link
Member

The idea is to allow LM generation from incomplete words, without explicitly knowing that the word is incomplete.
E.g.
Prefix: "We can go to another planet"
generation 1: "We can go to another planet like Mars."
generation 2: "We can go to another planetarium since this one is closed."
In the case of generation 2, we need to have a joiner at some point. In most cases, joiners are at the end of the subword. But, we can't know here beforehand that we will need a joiner. Hence, the model would learn itself to place a standalone joiner when needed.

@funboarder13920
Copy link
Collaborator Author

funboarder13920 commented Feb 17, 2021

Could you elaborate a bit about the interest of this joiner disjoin dropout mechanism?
If we drop independent joiner, we may have trouble getting the original sentence when detokenize for some sequence. Also by randomly spliting left/right joiner from tokens will increase sequence length and cause more tokens falls in <unk>.
Is there any good reason to apply this despite all those potential limitations/conflicts? What's the objective of this proposal?

The goal is to make possible inference at any point in the sentence even in the middle of a word without having to handle that at translation/generation time.
Do you have an example of a sequence that cannot be decoded with this mechanism ? I've tried a few and the joiner marker will join words even when it is not attached to any token.
Regarding the increase of the sequence length, it's the same issue with bpe dropout. We chose to introduce randomness to limit the increase of seq length rather than using joiner_new.
Some token might fall in <unk>, this will happen with rare token that might not be in the vocabulary anyway, the number of encountered <unk> increases with bpe_dropout as well as with joiner disjoin dropout. I chose to cut the bpe construction to quite high frequency merges so there should not be a lot of <unk>, most frequent tokens will be seen when building the vocabulary. For example, we go from a 43704 vocab size with bpe only to a 46621 vocab size with bpe dropout and joiner disjoin dropout, I think we got most of the tokens that can appear but I can still check the number of <unk> seen. Moreover, the randomness in these <unk> and the amount of data used for the task might make <unk> not even be an issue.

One issue might be that <unk> is considered in the loss function and this could increase it's likelihood.

@Zenglinxiao
Copy link
Contributor

Do you have an example of a sequence that cannot be decoded with this mechanism ? I've tried a few and the joiner marker will join words even when it is not attached to any token.

Yes, the joiner mark will join words no matter attached to token or not. Actually I'm talking about the line 85-86 in the method dropout_separate_joiner where individual joiners were simply ignored.
Ex: "word ■ ," --detok--> "word,"; But "word ," --detok--> "word ,".

@funboarder13920
Copy link
Collaborator Author

funboarder13920 commented Feb 17, 2021

It is a mistake, it needs to be added to the out_seq not removed.
I don't expect to encounter individual joiners, it seems that they are attached to the punctuation in this mode but I handled this case if someone wants to use another mode

@funboarder13920 funboarder13920 changed the title [WIP] implement safe bpe dropout for LM + joiner disjoin dropout Safe bpe dropout for LM + joiner disjoin dropout Feb 19, 2021
@funboarder13920 funboarder13920 changed the title Safe bpe dropout for LM + joiner disjoin dropout [WIP] Safe bpe dropout for LM + joiner disjoin dropout Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants