This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 538
[Script] Valid sequence length used in Electra dynamic masking #1321
Labels
bug
Something isn't working
Comments
This issue finds a fatal problem that makes |
For a quick fix, we may change this section gluon-nlp/scripts/pretraining/pretraining_utils.py Lines 497 to 503 in 970318d
We can change that to valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)
for ignore_token in ignore_tokens:
valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token) In addition, I think it will be better to move it to the preprocessing phase. |
You cannot use minus here, some of values may end being negative numbers after that. Use multiply will solve the problem. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Description
valid_candidates
is used to mark the non-reserve tokens in the sequence in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L503. For example, for a sequence likeThe corresponding
valid_candidates
tokens should be like:In short,
valid_candidates
mask out tokens like[CLS]
[SEP]
and[PAD]
. Current implementation ofvalid_candidates
is wrong. It will always output sequences with all1
s.The problem is that the initialization of
valid_candidates
is wrong, as in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L497valid_candidates
is initialized to be all 1s. When doing subsequent operations, the value will never change.Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
The text was updated successfully, but these errors were encountered: