Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Script] Valid sequence length used in Electra dynamic masking #1321

Closed
liuzh47 opened this issue Aug 27, 2020 · 3 comments
Closed

[Script] Valid sequence length used in Electra dynamic masking #1321

liuzh47 opened this issue Aug 27, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@liuzh47
Copy link
Contributor

liuzh47 commented Aug 27, 2020

Description

valid_candidates is used to mark the non-reserve tokens in the sequence in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L503. For example, for a sequence like

[CLS] Manhattan is the core of New York City.[SEP][PAD][PAD][PAD]

The corresponding valid_candidates tokens should be like:

01111111110000

In short, valid_candidates mask out tokens like [CLS] [SEP] and [PAD]. Current implementation of valid_candidates is wrong. It will always output sequences with all 1s.

The problem is that the initialization of valid_candidates is wrong, as in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L497

 valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)

valid_candidates is initialized to be all 1s. When doing subsequent operations, the value will never change.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@liuzh47 liuzh47 added the bug Something isn't working label Aug 27, 2020
@zheyuye
Copy link
Member

zheyuye commented Aug 27, 2020

This issue finds a fatal problem that makes valid_candidates invalidated

@sxjscience
Copy link
Member

For a quick fix, we may change this section

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)
ignore_tokens = [self.vocab.cls_id, self.vocab.sep_id, self.vocab.pad_id]
for ignore_token in ignore_tokens:
# TODO(zheyuye), Update when operation += supported
valid_candidates = valid_candidates + \
F.np.not_equal(input_ids, ignore_token)

We can change that to

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)

In addition, I think it will be better to move it to the preprocessing phase.

@liuzh47
Copy link
Contributor Author

liuzh47 commented Aug 27, 2020

For a quick fix, we may change this section

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)
ignore_tokens = [self.vocab.cls_id, self.vocab.sep_id, self.vocab.pad_id]
for ignore_token in ignore_tokens:
# TODO(zheyuye), Update when operation += supported
valid_candidates = valid_candidates + \
F.np.not_equal(input_ids, ignore_token)

We can change that to

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)

In addition, I think it will be better to move it to the preprocessing phase.

You cannot use minus here, some of values may end being negative numbers after that. Use multiply will solve the problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants