-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Safe bpe dropout for LM + joiner disjoin dropout #2009
base: master
Are you sure you want to change the base?
Conversation
funboarder13920
commented
Feb 17, 2021
•
edited
Loading
edited
- On LM tasks: copy src to target in tokenizer transforms otherwise it will be hazardous when tokenizers have random behaviors like dropout
- remove dropout in tokenizers during validation mode
- implement a "disjoin joiner with dropout" transform to make possible inference at any point in the sentence
onmt/transforms/misc.py
Outdated
else: | ||
src_out = self.dropout_separate_joiner(example["src"], "src") | ||
example["src"] = src_out | ||
if self.opts.model_task == ModelTask.LANGUAGE_MODEL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this won't work on build_vocab, model_task is a model parameter and not a dynamic corpus parameter.
Same issue in tokenizers.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_train
should be false anyways in build_vocab
, no?
EDIT: Nevermind, I mixed up with build_vocab_only
.
if elem == SubwordMarker.JOINER: | ||
continue | ||
if elem.startswith(SubwordMarker.JOINER): | ||
if random.random() < dropout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the necessity of doing right and left token sides
It might create a difficulty to retrieve the initial token when detokenizing.
I can use a special token do distinguish left and right or remove the left side disjoin which doesn't occur much (mainly in punctuation)
onmt/transforms/tokenize.py
Outdated
kwopts['bpe_dropout'] = subword_alpha if is_train else 0 | ||
elif subword_type == 'sentencepiece': | ||
kwopts['sp_model_path'] = subword_model | ||
kwopts['sp_nbest_size'] = subword_nbest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can directly reassign on subword_alpha
& subword_nbest
to disable both bpe_dropout
/sp sampling.
_diff_vocab = ( | ||
src_subword_kwargs.get('vocabulary_path', '') != | ||
tgt_subword_kwargs.get('vocabulary_path', '') or | ||
src_subword_kwargs.get('vocabulary_threshold', 0) != | ||
tgt_subword_kwargs.get('vocabulary_threshold', 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer current one which seems more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit about the interest of this joiner disjoin dropout
mechanism?
If we drop independent joiner, we may have trouble getting the original sentence when detokenize for some sequence. Also by randomly spliting left/right joiner from tokens will increase sequence length and cause more tokens falls in <unk>
.
Is there any good reason to apply this despite all those potential limitations/conflicts? What's the objective of this proposal?
The idea is to allow LM generation from incomplete words, without explicitly knowing that the word is incomplete. |
The goal is to make possible inference at any point in the sentence even in the middle of a word without having to handle that at translation/generation time. One issue might be that |
Yes, the joiner mark will join words no matter attached to token or not. Actually I'm talking about the line 85-86 in the method |
It is a mistake, it needs to be added to the out_seq not removed. |