Aggregation strategy does not respect original utterance #18738

ierezell · 2022-08-23T14:59:32Z

System Info

transformers version: 4.21.1
Platform: Linux-5.19.2-arch1-1-x86_64-with-glibc2.36
Python version: 3.9.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@Narsil
@LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce :

Finetune a model for token classification.
Get the results as follow with AggregationStrategy.SIMPLE:

Input : "Hi, can you book me flight ticket ? Also, will there be some snacks available on the plane ? Many thanks."

[
    {"entity_group": "lbl1", "score": 0.0, "word": "Hi",  "start": 0, "end": 2},
    {"entity_group": "lbl2", "score": 0.0, "word": ",", "start": 2, "end": 3},
    {"entity_group": "lbl3", "score": 0.0, "word": "can you book me flight ticket?", "start": 4, "end": 35},
    {"entity_group": "lbl2", "score": 0.0, "word": "Also", "start": 36, "end": 40},
    {"entity_group": "lbl3", "score": 0.0, "word": ", will there be some snacks available on the plane?", "start": 40, "end": 92},
    {"entity_group": "lbl1", "score": 0.0, "word": "Many thanks.", "start": 93, "end": 105}
]

Note that the start and end of the Aggregated Output are good ! But the "word" is not.

Do the same thing without aggregation and get :

[
//...
{'entity': 'lbl3', 'score': 0.0, 'index': 3, 'word': 'can', 'start': 4, 'end': 7}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 4, 'word': 'you', 'start': 8, 'end': 11}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 5, 'word': 'book', 'start': 12, 'end': 16},
{'entity': 'lbl3', 'score': 0.0, 'index': 6, 'word': 'me', 'start': 17, 'end': 19}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 7, 'word': 'flight', 'start': 20, 'end': 26}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 8, 'word': 'ticket', 'start': 27, 'end': 33}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 9, 'word': '?', 'start': 34, 'end': 35}, 
//...
]

This sentence with the index is well extracted as input[27:33]="ticket" + input[33:34] = " " + input[34:35]="?" = "ticket ?"

Expected behavior

Having the returned word span matches exactly the original utterance as we could use indexes and text span in the rest of our code.

Thanks in advance,
Have a great day :)

The text was updated successfully, but these errors were encountered:

Narsil · 2022-08-25T09:24:04Z

HI @ierezell ,

It does make sense what you asking for, but it's probably not going to happen since it would be a backward incompatibilty.
Pinging core maintainers to get advice on this @sgugger @LysandreJik.

"word" was an output since the creation of the pipeline and corresponds to the tokens being decoded: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py#L409

This means that during decoding anything can happen to the string and there is no reason for it to be like the original string.

What we could do, is use the offsets instead to lookup in the original string what was actually in there, but that would break existing pipelines relying on this code to actually use decode.

The non breaking solution would be to add a new key "original_word" (or a better name) that does the indexing for you. But that's now 2 different strings in the returned results, and that's not super obvious which one is better (they're going to be the same most of the time).
Do nothing (current solution).
Break backward compatbility, that's probably for v5.

There are a lot of quirks in the pipelines like this which are not huge changes, but that we still can't do because it would mean breaking user code.

ierezell · 2022-08-25T13:42:01Z

Hello @Narsil,

I understood the underlying problem as I checked the code before opening the issue and saw the "tokenizer problem".

It's not a "huge" problem as anyone can still reconstruct the "original" spans with starts and stops as they're "correct".
However, maybe a little bit more doc on this behavior can help users avoid these misconceptions about the returned data.

For the solutions :

For my use case would be best, but for 90% of the other users, it's too many troubles for the team to maintain, and 2. is enough as start and stop, used correctly is doing what I need.
In my opinion the best, but as said above may be documenting that to tell the users to use start and stop to get spans instead of innocently relying on the "word" if their application needs to do processing with the original utterance.
Would totally solve the issue but for sure let's keep that for later as it's breaking.

I totally understand that you need to keep a stable API, and I understand the underlying problem. I just wanted to flag that innocent people like me can misuse the returned data.

Thanks for the information and help :)
You can close the issue if you want, as 2. is enough and you're aware of the use case.

Have a great day

Narsil · 2022-08-25T14:25:36Z

Thanks for understanding, I'll let core maintainers conclude on this.

For the documentation, would this be better/enough ?
#18763

ierezell added the bug label Aug 23, 2022

Narsil mentioned this issue Aug 25, 2022

Improving the documentation for "word", within the pipeline. #18763

Merged

5 tasks

Narsil closed this as completed in #18763 Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation strategy does not respect original utterance #18738

Aggregation strategy does not respect original utterance #18738

ierezell commented Aug 23, 2022 •

edited

Loading

Narsil commented Aug 25, 2022

ierezell commented Aug 25, 2022 •

edited

Loading

Narsil commented Aug 25, 2022

Aggregation strategy does not respect original utterance #18738

Aggregation strategy does not respect original utterance #18738

Comments

ierezell commented Aug 23, 2022 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Narsil commented Aug 25, 2022

ierezell commented Aug 25, 2022 • edited Loading

Narsil commented Aug 25, 2022

ierezell commented Aug 23, 2022 •

edited

Loading

ierezell commented Aug 25, 2022 •

edited

Loading