Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<PAD> tags should be filtered out from the output of the Tagger #64

Open
oterrier opened this issue Dec 6, 2019 · 17 comments
Open

<PAD> tags should be filtered out from the output of the Tagger #64

oterrier opened this issue Dec 6, 2019 · 17 comments
Assignees

Comments

@oterrier
Copy link

oterrier commented Dec 6, 2019

In a sequence labelling scenario, the internal tag can be present in the output of the Tagger.tag() method.
As they are internal they should probably been filtered out

I would be more than happy to provide a fix in a PR if you tell me where it is better to fix :

In WordPreprocessor.inverse_transform() ?

@kermitt2
Copy link
Owner

kermitt2 commented Dec 8, 2019

Thank you Olivier!
Do you have an example with <PAD> in the output of Tagger.tag()?

In principle we have:

the_tags = list(zip(tokens, tags))

so, if I am not wrong (but I am often wrong), the list will be of size the shortest one of tokens and tags, and there should not be extra tag not corresponding to a token, so no <PAD>.

@oterrier
Copy link
Author

Hi
I'm pretty sure we have seen it in our output (I had to patch my client code to take it into account) but I don't find a way to reproduce it right now ....

@lfoppiano
Copy link
Collaborator

@oterrier could you remember which model you had this behaviour? Was it a grobid model?

I got this issue while running grobid + delft, when the figure parser was used:

Feb 27 18:07:31 falcon bash[18802]: running thread: 33
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: Loading DeLFT model for figure.
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: figure = Sequence('figure')
Feb 27 18:07:31 falcon bash[18802]: INFO  [2020-02-27 09:07:31,744] org.grobid.core.jni.DeLFTModel: figure.load(dir_path='/data/workspace/services/grobidl/grobid-superconductors/../grobid-home/models')
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,685] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for Crucible BSCCO Precursor KCl ΔT Substrate ΔT Page 14 of 19 AUTHOR SUBMITTED MANUSCRIPT -SUST-102528.R1
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,793] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for SUST
Feb 27 18:07:33 falcon bash[18802]: WARN  [2020-02-27 09:07:33,793] org.grobid.core.engines.FigureParser: Unexpected figure model label - <PAD> for R1

@oterrier
Copy link
Author

@lfoppiano Unfortunately I don't have and easy scenario to replicate the issue but I'm pretty sure that it was not with a grobid model

Best

Olivier

@lfoppiano lfoppiano self-assigned this Mar 2, 2020
@lfoppiano
Copy link
Collaborator

I've got other cases:

ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for ZN CONCENTRATION RESISTIVITY OF EuBaz
ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for MAGNETIC
ERROR [2020-03-12 01:33:35,044] org.grobid.core.engines.SuperconductorsParser: Warning: unexpected label in quantity parser: <PAD> for PRESSURE AND Zn CONCENTRATION

@kermitt2
Copy link
Owner

Which architecture did you use?
Normally there should not be any in list of labelling, so there is another source problem to fix if it's the case!

@lfoppiano
Copy link
Collaborator

Which architecture did you use?
Normally there should not be any in list of labelling, so there is another source problem to fix if it's the case!

The quantity parser used is the following:

{
    "model_name": "quantities",
    "model_type": "BidLSTM_CRF",
    "embeddings_name": "glove-840B",
    "char_vocab_size": 224,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "num_char_lstm_units": 25,
    "max_char_length": 30,
    "max_sequence_length": null,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "case_embedding_size": 5,
    "dropout": 0.5,
    "recurrent_dropout": 0.5,
    "use_char_feature": true,
    "use_crf": true,
    "fold_number": 1,
    "batch_size": 20,
    "use_ELMo": false,
    "use_BERT": false
}

@lfoppiano
Copy link
Collaborator

I'm dissecting the superconductor model, which has the same problem (the log message was a copy pasta spaghetti, leading to the wrong model).

The preprocessor list of tag includes <PAD>. Is this normal?

vocab_tag = {dict: 14} {'<PAD>': 0, 'O': 1, 'B-<tc>': 2, 'I-<tc>': 3, 'B-<material>': 4, 'I-<material>': 5, 'B-<tcValue>': 6, 'I-<tcValue>': 7, 'B-<class>': 8, 'I-<class>': 9, 'B-<me_method>': 10, 'I-<me_method>': 11, 'B-<pressure>': 12, 'I-<pressure>': 13}
 '<PAD>' = {int} 0
 'O' = {int} 1
 'B-<tc>' = {int} 2
 'I-<tc>' = {int} 3
 'B-<material>' = {int} 4
 'I-<material>' = {int} 5
 'B-<tcValue>' = {int} 6
 'I-<tcValue>' = {int} 7
 'B-<class>' = {int} 8
 'I-<class>' = {int} 9
 'B-<me_method>' = {int} 10
 'I-<me_method>' = {int} 11
 'B-<pressure>' = {int} 12
 'I-<pressure>' = {int} 13
 __len__ = {int} 14

Here some additional information:
image

also check another model, like the date and <PAD> is within the tag_vocab

{'B-<month>': 2, 'B-<day>': 3, 'O': 1, 'I-<year>': 5, 'I-<day>': 6, '<PAD>': 0, 'B-<year>': 4}

I'm not sure what's correct and what's wrong here...

@kermitt2
Copy link
Owner

kermitt2 commented Mar 12, 2020

The preprocessor list of tag includes <PAD>. Is this normal?

yes it is for padding the label vector. Every "channels" will have a <PAD> stuff at index 0 in their associated vocab map.

So this is all good so far.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Mar 12, 2020

If the <PAD> is used in training, is it normal that it will pop out when predicting? Should be just removed or replaced with <other>?

@kermitt2
Copy link
Owner

<PAD> cannot pop-out normally when predicting because everything is cut based on the length of the token sequence, which is what I mentioned above.

If we have a <PAD> in the actual label list, there is something badly aligned in the token/tag list, and this is the actual bug I think, maybe due to some special character maybe? It's a problem because it can also shift some labels to the wrong tokens, and we should not just filter out the <PAD> but try to find the reason for this alignment issue.

@lfoppiano
Copy link
Collaborator

👍 . I understand now 😄

@lfoppiano
Copy link
Collaborator

lfoppiano commented Mar 13, 2020

I'm testing the prediction and the tokens and predictions are aligned. I did not find anything suspicious here.

Here my test case:
Model: grobid-superconductors.zip

with the following sentence (add it in grobidTagger.py):

            elif model == 'superconductors':
                someTexts.append("ANISOTROPIC λ VALUES")
                someTexts.append("ANISOTROPIC A VALUES")

ANISOTROPIC and λ get <PAD> in output...

Might be that there is a misalignment in the training?

@lfoppiano
Copy link
Collaborator

Another curious behaviour, if you use the figure model and try to tag the string SUST, you will get back a list of lenght 2 with ["<PAD>", "<PAD>"].

        elif model == 'figure':
            someTexts.append("SUST")

While the second <PAD> makes sense (it's the result of padding not to leave sequence with only one element) the first one does not...

@kermitt2
Copy link
Owner

kermitt2 commented Apr 1, 2020

While the second makes sense (it's the result of padding not to leave sequence with only one element) the first one does not...

If there's a batch with one sequence alone of length 1, we extend it to avoid an error from tensorflow 1.* (it might be fixed in tf 2.0). So that would be a normal behaviour (this is the purpose of the extend parameter in preprocessor and embeddings).

@lfoppiano
Copy link
Collaborator

Yes, indeed.

Some other questions:

  1. is normal that the figure model does not have the <other> or O tag?
{'<PAD>': 0, 'B-<figure_head>': 1, 'B-<label>': 2, 'B-<figDesc>': 3, 'I-<figDesc>': 4, 'I-<figure_head>': 5, 'B-<content>': 6, 'I-<content>': 7}
  1. the result from
                preds = self.model.predict_on_batch(generator_output[0])

I get:
image

where we have embedding of the text (300 element), the characters and the length.

But then the output is:

[[[1. 0. 0. 0. 0. 0. 0. 0.],  [1. 0. 0. 0. 0. 0. 0. 0.]]]

8 element -> 8 labels as before, somehow the first 1 is the <PAD>

@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 1, 2020

I found out that, for example in the figure model, when training, the following tokens are transformed a zero array of embeddings (probably because the embeddings do not contains such tokens), here some example:

Homoplasy
leucokranos
Ω
Ω
hypercementosis
×1
hypercementosis
×15
cEq
hypercementosis
×20
×50
SEC11L3
SEC11L3
PDACs
SEC11L3
100%
surements
EATs
hypercementosis
distobuccal
mesiobuccal
×100
×50
45º
90º
Uninformative
Δe
reion
Ω
Ω
λ=1550
70º
ihDNA
pregenomic
pgRNA
pgRNA
ihDNA
pgRNA
ihDNA
cEq

so here we will get the batch_x as zeroes array. Could this be the problem?
Also when we pad the sequence, we introduce a zeroed array on the X and a PAD array on the Y. Maybe we should use two different X arrays for the padding and to represent tokens that have no embeddings?

I added these to line 139 of data_generator.py and run train figure

        for b in range(0, len(batch_x)):
            for i in range(len(x_tokenized[b]), len(batch_x[b])):
                assert np.sum(batch_x[b][i]) == 0.0
                assert list(batch_y[b][i]) == [1, 0, 0, 0, 0, 0, 0, 0]


        for b in range(0, len(batch_y)):
            for i in range(0, len(batch_y[b])):
                if list(batch_y[b][i]) == [1, 0, 0, 0, 0, 0, 0, 0]:
                    assert np.sum(batch_x[b][i]) == 0.0

        for b in range(0, len(batch_x)):
            for i in range(0, len(batch_x[b])):
                if np.sum(batch_x[b][i]) == 0.0:
                    if list(batch_y[b][i]) != [1, 0, 0, 0, 0, 0, 0, 0]:
                        print(sub_x[b][i])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants