TokenClassificationPipeline: ignoring subwords #10763

francescorubbo · 2021-03-17T05:50:21Z

Environment info

transformers version: 4.4.1
Platform: Linux-4.15.0-136-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.8.0 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Library:

pipelines: @LysandreJik

Information

Model I am using (Bert, XLNet ...):
Any NER model, e.g. elastic/distilbert-base-cased-finetuned-conll03-english

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Ignoring subwords using the TokenClassificationPipeline.

To reproduce

Steps to reproduce the behavior:

import transformers
pl = transformers.pipeline('ner', model="elastic/distilbert-base-cased-finetuned-conll03-english", tokenizer="elastic/distilbert-base-cased-finetuned-conll03-english", ignore_labels=[], ignore_subwords=True)
output = pl("Sir Testy McTest is testiful")

This outputs:

[{'word': 'Sir', 'score': 0.997665524482727, 'entity': 'O', 'index': 1, 'start': 0, 'end': 3}, {'word': 'Test', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 8}, {'word': '##y', 'score': 0.9581826329231262, 'entity': 'B-PER', 'index': 3, 'start': 8, 'end': 9}, {'word': 'M', 'score': 0.9105736613273621, 'entity': 'I-PER', 'index': 4, 'start': 10, 'end': 11}, {'word': '##c', 'score': 0.9090507626533508, 'entity': 'I-PER', 'index': 5, 'start': 11, 'end': 12}, {'word': '##T', 'score': 0.9545289874076843, 'entity': 'I-PER', 'index': 6, 'start': 12, 'end': 13}, {'word': '##est', 'score': 0.9441993832588196, 'entity': 'I-PER', 'index': 7, 'start': 13, 'end': 16}, {'word': 'is', 'score': 0.9999386072158813, 'entity': 'O', 'index': 8, 'start': 17, 'end': 19}, {'word': 'test', 'score': 0.9998794198036194, 'entity': 'O', 'index': 9, 'start': 20, 'end': 24}, {'word': '##iful', 'score': 0.9999022483825684, 'entity': 'O', 'index': 10, 'start': 24, 'end': 28}]

Expected behavior

The expected behavior would be the subwords token being merged with the preceding token, and their predictions ignored e.g.

{'word': 'Testy', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 9}

instead of

{'word': 'Test', 'score': 0.7986497282981873, 'entity': 'B-PER', 'index': 2, 'start': 4, 'end': 8}, {'word': '##y', 'score': 0.9581826329231262, 'entity': 'B-PER', 'index': 3, 'start': 8, 'end': 9}

In the current logic the flag ignore_subwords seems to be used only in combination with the grouped_entities https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/token_classification.py#L216 . The output obtained from the example input above, setting both flags as True:

[{'entity_group': 'O', 'score': 0.997665524482727, 'word': 'Sir', 'start': 0, 'end': 3}, {'entity_group': 'PER', 'score': 0.8546116948127747, 'word': 'Testy McTest', 'start': 4, 'end': 16}, {'entity_group': 'O', 'score': 0.9999090135097504, 'word': 'is testiful', 'start': 17, 'end': 28}]

while setting grouped_entities=True and ignore_subwords=False outputs

[{'entity_group': 'O', 'score': 0.997665524482727, 'word': 'Sir', 'start': 0, 'end': 3}, {'entity_group': 'PER', 'score': 0.7986497282981873, 'word': 'Test', 'start': 4, 'end': 8}, {'entity_group': 'PER', 'score': 0.9353070855140686, 'word': '##y McTest', 'start': 8, 'end': 16}, {'entity_group': 'O', 'score': 0.9999067584673563, 'word': 'is testiful', 'start': 17, 'end': 28}]

This seems counterintuitive as the grouped entities shouldn't be fragmented by subwords, and ignoring subwords shouldn't be conditioned on grouping entitities.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-03-17T15:15:04Z

Hello! Could you take a look at #10568 and let me know if it's interesting for you? It proposes a refactor of the two keywords you mentioned.

francescorubbo · 2021-03-18T03:44:19Z

Hello! Could you take a look at #10568 and let me know if it's interesting for you? It proposes a refactor of the two keywords you mentioned.

Yes! That would solve this issue. Thanks for the pointer. I'll post comments there.

github-actions · 2021-04-16T15:02:00Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Apr 27, 2021

francescorubbo mentioned this issue May 7, 2021

[TokenClassification] Label realignment for subword aggregation #11622

Closed

5 tasks

Narsil mentioned this issue May 11, 2021

[TokenClassification] Label realignment for subword aggregation #11680

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenClassificationPipeline: ignoring subwords #10763

TokenClassificationPipeline: ignoring subwords #10763

francescorubbo commented Mar 17, 2021

LysandreJik commented Mar 17, 2021

francescorubbo commented Mar 18, 2021

github-actions bot commented Apr 16, 2021

TokenClassificationPipeline: ignoring subwords #10763

TokenClassificationPipeline: ignoring subwords #10763

Comments

francescorubbo commented Mar 17, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

LysandreJik commented Mar 17, 2021

francescorubbo commented Mar 18, 2021

github-actions bot commented Apr 16, 2021