Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenClassificationPipeline: ignoring subwords #10763

Closed
2 of 4 tasks
francescorubbo opened this issue Mar 17, 2021 · 3 comments
Closed
2 of 4 tasks

TokenClassificationPipeline: ignoring subwords #10763

francescorubbo opened this issue Mar 17, 2021 · 3 comments

Comments

@francescorubbo
Copy link
Contributor

Environment info

  • transformers version: 4.4.1
  • Platform: Linux-4.15.0-136-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.8.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Library:

Information

Model I am using (Bert, XLNet ...):
Any NER model, e.g. elastic/distilbert-base-cased-finetuned-conll03-english

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Ignoring subwords using the TokenClassificationPipeline.

To reproduce

Steps to reproduce the behavior:

import transformers
pl = transformers.pipeline('ner', model="elastic/distilbert-base-cased-finetuned-conll03-english", tokenizer="elastic/distilbert-base-cased-finetuned-conll03-english", ignore_labels=[], ignore_subwords=True)
output = pl("Sir Testy McTest is testiful")

This outputs:

[{'word': 'Sir', 'score': 0.997665524482727, 'entity': 'O', 'index': 1, 'start': 0, 'end': 3}, {'word': 'Test', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 8}, {'word': '##y', 'score': 0.9581826329231262, 'entity': 'B-PER', 'index': 3, 'start': 8, 'end': 9}, {'word': 'M', 'score': 0.9105736613273621, 'entity': 'I-PER', 'index': 4, 'start': 10, 'end': 11}, {'word': '##c', 'score': 0.9090507626533508, 'entity': 'I-PER', 'index': 5, 'start': 11, 'end': 12}, {'word': '##T', 'score': 0.9545289874076843, 'entity': 'I-PER', 'index': 6, 'start': 12, 'end': 13}, {'word': '##est', 'score': 0.9441993832588196, 'entity': 'I-PER', 'index': 7, 'start': 13, 'end': 16}, {'word': 'is', 'score': 0.9999386072158813, 'entity': 'O', 'index': 8, 'start': 17, 'end': 19}, {'word': 'test', 'score': 0.9998794198036194, 'entity': 'O', 'index': 9, 'start': 20, 'end': 24}, {'word': '##iful', 'score': 0.9999022483825684, 'entity': 'O', 'index': 10, 'start': 24, 'end': 28}]

Expected behavior

The expected behavior would be the subwords token being merged with the preceding token, and their predictions ignored e.g.

{'word': 'Testy', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 9}

instead of

{'word': 'Test', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 8}, {'word': '##y', 'score': 0.9581826329231262, 'entity': 'B-PER', 'index': 3, 'start': 8, 'end': 9}

In the current logic the flag ignore_subwords seems to be used only in combination with the grouped_entities https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/token_classification.py#L216 . The output obtained from the example input above, setting both flags as True:

[{'entity_group': 'O', 'score': 0.997665524482727, 'word': 'Sir', 'start': 0, 'end': 3}, {'entity_group': 'PER', 'score': 0.8546116948127747, 'word': 'Testy McTest', 'start': 4, 'end': 16}, {'entity_group': 'O', 'score': 0.9999090135097504, 'word': 'is testiful', 'start': 17, 'end': 28}]

while setting grouped_entities=True and ignore_subwords=False outputs

[{'entity_group': 'O', 'score': 0.997665524482727, 'word': 'Sir', 'start': 0, 'end': 3}, {'entity_group': 'PER', 'score': 0.7986497282981873, 'word': 'Test', 'start': 4, 'end': 8}, {'entity_group': 'PER', 'score': 0.9353070855140686, 'word': '##y McTest', 'start': 8, 'end': 16}, {'entity_group': 'O', 'score': 0.9999067584673563, 'word': 'is testiful', 'start': 17, 'end': 28}]

This seems counterintuitive as the grouped entities shouldn't be fragmented by subwords, and ignoring subwords shouldn't be conditioned on grouping entitities.

@LysandreJik
Copy link
Member

Hello! Could you take a look at #10568 and let me know if it's interesting for you? It proposes a refactor of the two keywords you mentioned.

@francescorubbo
Copy link
Contributor Author

Hello! Could you take a look at #10568 and let me know if it's interesting for you? It proposes a refactor of the two keywords you mentioned.

Yes! That would solve this issue. Thanks for the pointer. I'll post comments there.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants