[Script] Valid sequence length used in Electra dynamic masking #1321

liuzh47 · 2020-08-27T08:44:21Z

Description

valid_candidates is used to mark the non-reserve tokens in the sequence in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L503. For example, for a sequence like

[CLS] Manhattan is the core of New York City.[SEP][PAD][PAD][PAD]

The corresponding valid_candidates tokens should be like:

01111111110000

In short, valid_candidates mask out tokens like [CLS] [SEP] and [PAD]. Current implementation of valid_candidates is wrong. It will always output sequences with all 1s.

The problem is that the initialization of valid_candidates is wrong, as in https://github.com/dmlc/gluon-nlp/blob/master/scripts/pretraining/pretraining_utils.py#L497

 valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)

valid_candidates is initialized to be all 1s. When doing subsequent operations, the value will never change.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

The text was updated successfully, but these errors were encountered:

zheyuye · 2020-08-27T08:49:02Z

This issue finds a fatal problem that makes valid_candidates invalidated

sxjscience · 2020-08-27T08:53:13Z

For a quick fix, we may change this section

gluon-nlp/scripts/pretraining/pretraining_utils.py

Lines 497 to 503 in 970318d

    
           valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
        
           ignore_tokens = [self.vocab.cls_id, self.vocab.sep_id, self.vocab.pad_id] 
        
           for ignore_token in ignore_tokens: 
        
               # TODO(zheyuye), Update when operation += supported 
        
               valid_candidates = valid_candidates + \ 
        
                   F.np.not_equal(input_ids, ignore_token)

We can change that to

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)

In addition, I think it will be better to move it to the preprocessing phase.

liuzh47 · 2020-08-27T09:34:09Z

For a quick fix, we may change this section

gluon-nlp/scripts/pretraining/pretraining_utils.py

Lines 497 to 503 in 970318d

valid_candidates = F.np.ones_like(input_ids, dtype=np.bool)

ignore_tokens = [self.vocab.cls_id, self.vocab.sep_id, self.vocab.pad_id]

for ignore_token in ignore_tokens:

# TODO(zheyuye), Update when operation += supported

valid_candidates = valid_candidates + \

F.np.not_equal(input_ids, ignore_token)

We can change that to
valid_candidates = F.np.ones_like(input_ids, dtype=np.bool) 
for ignore_token in ignore_tokens: 
    valid_candidates = valid_candidates - F.np.equal(input_ids, ignore_token)
In addition, I think it will be better to move it to the preprocessing phase.

You cannot use minus here, some of values may end being negative numbers after that. Use multiply will solve the problem.

liuzh47 added the bug Something isn't working label Aug 27, 2020

liuzh47 mentioned this issue Aug 28, 2020

[BUGFIX] fix valid candidates issue #1323

Merged

6 tasks

sxjscience closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Script] Valid sequence length used in Electra dynamic masking #1321

[Script] Valid sequence length used in Electra dynamic masking #1321

liuzh47 commented Aug 27, 2020 •

edited

Loading

zheyuye commented Aug 27, 2020

sxjscience commented Aug 27, 2020

liuzh47 commented Aug 27, 2020

[Script] Valid sequence length used in Electra dynamic masking #1321

[Script] Valid sequence length used in Electra dynamic masking #1321

Comments

liuzh47 commented Aug 27, 2020 • edited Loading

Description

Environment

zheyuye commented Aug 27, 2020

sxjscience commented Aug 27, 2020

liuzh47 commented Aug 27, 2020

liuzh47 commented Aug 27, 2020 •

edited

Loading