-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<PAD> tags should be filtered out from the output of the Tagger #64
Comments
Thank you Olivier! In principle we have:
so, if I am not wrong (but I am often wrong), the list will be of size the shortest one of tokens and tags, and there should not be extra tag not corresponding to a token, so no |
Hi |
@oterrier could you remember which model you had this behaviour? Was it a grobid model? I got this issue while running grobid + delft, when the figure parser was used:
|
@lfoppiano Unfortunately I don't have and easy scenario to replicate the issue but I'm pretty sure that it was not with a grobid model Best Olivier |
I've got other cases:
|
Which architecture did you use? |
The quantity parser used is the following: {
"model_name": "quantities",
"model_type": "BidLSTM_CRF",
"embeddings_name": "glove-840B",
"char_vocab_size": 224,
"case_vocab_size": 8,
"char_embedding_size": 25,
"num_char_lstm_units": 25,
"max_char_length": 30,
"max_sequence_length": null,
"word_embedding_size": 300,
"num_word_lstm_units": 100,
"case_embedding_size": 5,
"dropout": 0.5,
"recurrent_dropout": 0.5,
"use_char_feature": true,
"use_crf": true,
"fold_number": 1,
"batch_size": 20,
"use_ELMo": false,
"use_BERT": false
} |
yes it is for padding the label vector. Every "channels" will have a So this is all good so far. |
If the |
If we have a |
👍 . I understand now 😄 |
I'm testing the prediction and the tokens and predictions are aligned. I did not find anything suspicious here. Here my test case: with the following sentence (add it in
ANISOTROPIC and λ get Might be that there is a misalignment in the training? |
Another curious behaviour, if you use the figure model and try to tag the string
While the second |
If there's a batch with one sequence alone of length 1, we extend it to avoid an error from tensorflow 1.* (it might be fixed in tf 2.0). So that would be a normal behaviour (this is the purpose of the |
Yes, indeed. Some other questions:
preds = self.model.predict_on_batch(generator_output[0]) where we have embedding of the text (300 element), the characters and the length. But then the output is:
8 element -> 8 labels as before, somehow the first 1 is the |
I found out that, for example in the figure model, when training, the following tokens are transformed a zero array of embeddings (probably because the embeddings do not contains such tokens), here some example:
so here we will get the batch_x as zeroes array. Could this be the problem? I added these to line 139 of
|
In a sequence labelling scenario, the internal tag can be present in the output of the Tagger.tag() method.
As they are internal they should probably been filtered out
I would be more than happy to provide a fix in a PR if you tell me where it is better to fix :
In WordPreprocessor.inverse_transform() ?
The text was updated successfully, but these errors were encountered: