<PAD> tags should be filtered out from the output of the Tagger #64

oterrier · 2019-12-06T12:24:51Z

In a sequence labelling scenario, the internal tag can be present in the output of the Tagger.tag() method.
As they are internal they should probably been filtered out

I would be more than happy to provide a fix in a PR if you tell me where it is better to fix :

In WordPreprocessor.inverse_transform() ?

kermitt2 · 2019-12-08T03:59:13Z

Thank you Olivier!
Do you have an example with <PAD> in the output of Tagger.tag()?

In principle we have:

the_tags = list(zip(tokens, tags))

so, if I am not wrong (but I am often wrong), the list will be of size the shortest one of tokens and tags, and there should not be extra tag not corresponding to a token, so no <PAD>.

oterrier · 2019-12-12T08:32:03Z

Hi
I'm pretty sure we have seen it in our output (I had to patch my client code to take it into account) but I don't find a way to reproduce it right now ....

lfoppiano · 2020-02-28T00:06:03Z

@oterrier could you remember which model you had this behaviour? Was it a grobid model?

I got this issue while running grobid + delft, when the figure parser was used:

Feb 27 18:07:31 falcon bash[18802]: running thread: 33
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for figure.
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: figure = Sequence('figure')
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: figure.load(dir_path='/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models')
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,685] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for Crucible BSCCO Precursor KCl ΔT Substrate ΔT Page 14 of 19 AUTHOR SUBMITTED MANUSCRIPT -SUST-102528.R1
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,793] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for SUST
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,793] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for R1

oterrier · 2020-02-28T07:46:02Z

@lfoppiano Unfortunately I don't have and easy scenario to replicate the issue but I'm pretty sure that it was not with a grobid model

Best

Olivier

lfoppiano · 2020-03-12T01:37:24Z

I've got other cases:

ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for ZN CONCENTRATION RESISTIVITY OF EuBaz
ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for MAGNETIC
ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for PRESSURE AND Zn CONCENTRATION

kermitt2 · 2020-03-12T02:00:55Z

Which architecture did you use?
Normally there should not be any in list of labelling, so there is another source problem to fix if it's the case!

lfoppiano · 2020-03-12T02:56:17Z

Which architecture did you use?
Normally there should not be any in list of labelling, so there is another source problem to fix if it's the case!

The quantity parser used is the following:

{
    "model_name": "quantities",
    "model_type": "BidLSTM_CRF",
    "embeddings_name": "glove-840B",
    "char_vocab_size": 224,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "num_char_lstm_units": 25,
    "max_char_length": 30,
    "max_sequence_length": null,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "case_embedding_size": 5,
    "dropout": 0.5,
    "recurrent_dropout": 0.5,
    "use_char_feature": true,
    "use_crf": true,
    "fold_number": 1,
    "batch_size": 20,
    "use_ELMo": false,
    "use_BERT": false
}

lfoppiano · 2020-03-12T07:45:09Z

I'm dissecting the superconductor model, which has the same problem (the log message was a copy pasta spaghetti, leading to the wrong model).

The preprocessor list of tag includes <PAD>. Is this normal?

vocab_tag = {dict: 14} {'<PAD>': 0, 'O': 1, 'B-<tc>': 2, 'I-<tc>': 3, 'B-<material>': 4, 'I-<material>': 5, 'B-<tcValue>': 6, 'I-<tcValue>': 7, 'B-<class>': 8, 'I-<class>': 9, 'B-<me_method>': 10, 'I-<me_method>': 11, 'B-<pressure>': 12, 'I-<pressure>': 13}
 '<PAD>' = {int} 0
 'O' = {int} 1
 'B-<tc>' = {int} 2
 'I-<tc>' = {int} 3
 'B-<material>' = {int} 4
 'I-<material>' = {int} 5
 'B-<tcValue>' = {int} 6
 'I-<tcValue>' = {int} 7
 'B-<class>' = {int} 8
 'I-<class>' = {int} 9
 'B-<me_method>' = {int} 10
 'I-<me_method>' = {int} 11
 'B-<pressure>' = {int} 12
 'I-<pressure>' = {int} 13
 __len__ = {int} 14

Here some additional information:

also check another model, like the date and <PAD> is within the tag_vocab

{'B-<month>': 2, 'B-<day>': 3, 'O': 1, 'I-<year>': 5, 'I-<day>': 6, '<PAD>': 0, 'B-<year>': 4}

I'm not sure what's correct and what's wrong here...

kermitt2 · 2020-03-12T16:58:57Z

The preprocessor list of tag includes <PAD>. Is this normal?

yes it is for padding the label vector. Every "channels" will have a <PAD> stuff at index 0 in their associated vocab map.

So this is all good so far.

lfoppiano · 2020-03-12T22:57:33Z

If the <PAD> is used in training, is it normal that it will pop out when predicting? Should be just removed or replaced with <other>?

kermitt2 · 2020-03-12T23:18:10Z

<PAD> cannot pop-out normally when predicting because everything is cut based on the length of the token sequence, which is what I mentioned above.

If we have a <PAD> in the actual label list, there is something badly aligned in the token/tag list, and this is the actual bug I think, maybe due to some special character maybe? It's a problem because it can also shift some labels to the wrong tokens, and we should not just filter out the <PAD> but try to find the reason for this alignment issue.

lfoppiano · 2020-03-12T23:34:30Z

👍 . I understand now 😄

lfoppiano · 2020-03-13T01:14:22Z

I'm testing the prediction and the tokens and predictions are aligned. I did not find anything suspicious here.

Here my test case:
Model: grobid-superconductors.zip

with the following sentence (add it in grobidTagger.py):

            elif model == 'superconductors':
                someTexts.append("ANISOTROPIC λ VALUES")
                someTexts.append("ANISOTROPIC A VALUES")

ANISOTROPIC and λ get <PAD> in output...

Might be that there is a misalignment in the training?

lfoppiano · 2020-04-01T05:55:44Z

Another curious behaviour, if you use the figure model and try to tag the string SUST, you will get back a list of lenght 2 with ["<PAD>", "<PAD>"].

        elif model == 'figure':
            someTexts.append("SUST")

While the second <PAD> makes sense (it's the result of padding not to leave sequence with only one element) the first one does not...

kermitt2 · 2020-04-01T06:18:06Z

While the second makes sense (it's the result of padding not to leave sequence with only one element) the first one does not...

If there's a batch with one sequence alone of length 1, we extend it to avoid an error from tensorflow 1.* (it might be fixed in tf 2.0). So that would be a normal behaviour (this is the purpose of the extend parameter in preprocessor and embeddings).

lfoppiano · 2020-04-01T07:46:48Z

Yes, indeed.

Some other questions:

is normal that the figure model does not have the <other> or O tag?

{'<PAD>': 0, 'B-<figure_head>': 1, 'B-<label>': 2, 'B-<figDesc>': 3, 'I-<figDesc>': 4, 'I-<figure_head>': 5, 'B-<content>': 6, 'I-<content>': 7}

the result from

                preds = self.model.predict_on_batch(generator_output[0])

I get:

where we have embedding of the text (300 element), the characters and the length.

But then the output is:

[[[1. 0. 0. 0. 0. 0. 0. 0.],  [1. 0. 0. 0. 0. 0. 0. 0.]]]

8 element -> 8 labels as before, somehow the first 1 is the <PAD>

lfoppiano · 2020-04-01T08:29:38Z

I found out that, for example in the figure model, when training, the following tokens are transformed a zero array of embeddings (probably because the embeddings do not contains such tokens), here some example:

Homoplasy
leucokranos
Ω
Ω
hypercementosis
×1
hypercementosis
×15
cEq
hypercementosis
×20
×50
SEC11L3
SEC11L3
PDACs
SEC11L3
100%
surements
EATs
hypercementosis
distobuccal
mesiobuccal
×100
×50
45º
90º
Uninformative
Δe
reion
Ω
Ω
λ=1550
70º
ihDNA
pregenomic
pgRNA
pgRNA
ihDNA
pgRNA
ihDNA
cEq

so here we will get the batch_x as zeroes array. Could this be the problem?
Also when we pad the sequence, we introduce a zeroed array on the X and a PAD array on the Y. Maybe we should use two different X arrays for the padding and to represent tokens that have no embeddings?

I added these to line 139 of data_generator.py and run train figure

        for b in range(0, len(batch_x)):
            for i in range(len(x_tokenized[b]), len(batch_x[b])):
                assert np.sum(batch_x[b][i]) == 0.0
                assert list(batch_y[b][i]) == [1, 0, 0, 0, 0, 0, 0, 0]


        for b in range(0, len(batch_y)):
            for i in range(0, len(batch_y[b])):
                if list(batch_y[b][i]) == [1, 0, 0, 0, 0, 0, 0, 0]:
                    assert np.sum(batch_x[b][i]) == 0.0

        for b in range(0, len(batch_x)):
            for i in range(0, len(batch_x[b])):
                if np.sum(batch_x[b][i]) == 0.0:
                    if list(batch_y[b][i]) != [1, 0, 0, 0, 0, 0, 0, 0]:
                        print(sub_x[b][i])

lfoppiano self-assigned this Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<PAD> tags should be filtered out from the output of the Tagger #64

<PAD> tags should be filtered out from the output of the Tagger #64

oterrier commented Dec 6, 2019

kermitt2 commented Dec 8, 2019

oterrier commented Dec 12, 2019

lfoppiano commented Feb 28, 2020

oterrier commented Feb 28, 2020

lfoppiano commented Mar 12, 2020

kermitt2 commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

kermitt2 commented Mar 12, 2020 •

edited by lfoppiano

Loading

lfoppiano commented Mar 12, 2020 •

edited

Loading

kermitt2 commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

lfoppiano commented Mar 13, 2020 •

edited

Loading

lfoppiano commented Apr 1, 2020

kermitt2 commented Apr 1, 2020

lfoppiano commented Apr 1, 2020

lfoppiano commented Apr 1, 2020 •

edited

Loading

<PAD> tags should be filtered out from the output of the Tagger #64

<PAD> tags should be filtered out from the output of the Tagger #64

Comments

oterrier commented Dec 6, 2019

kermitt2 commented Dec 8, 2019

oterrier commented Dec 12, 2019

lfoppiano commented Feb 28, 2020

oterrier commented Feb 28, 2020

lfoppiano commented Mar 12, 2020

kermitt2 commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

kermitt2 commented Mar 12, 2020 • edited by lfoppiano Loading

lfoppiano commented Mar 12, 2020 • edited Loading

kermitt2 commented Mar 12, 2020

lfoppiano commented Mar 12, 2020

lfoppiano commented Mar 13, 2020 • edited Loading

lfoppiano commented Apr 1, 2020

kermitt2 commented Apr 1, 2020

lfoppiano commented Apr 1, 2020

lfoppiano commented Apr 1, 2020 • edited Loading

kermitt2 commented Mar 12, 2020 •

edited by lfoppiano

Loading

lfoppiano commented Mar 12, 2020 •

edited

Loading

lfoppiano commented Mar 13, 2020 •

edited

Loading

lfoppiano commented Apr 1, 2020 •

edited

Loading