Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation strategy does not respect original utterance #18738

Closed
2 of 4 tasks
ierezell opened this issue Aug 23, 2022 · 3 comments · Fixed by #18763
Closed
2 of 4 tasks

Aggregation strategy does not respect original utterance #18738

ierezell opened this issue Aug 23, 2022 · 3 comments · Fixed by #18763
Labels

Comments

@ierezell
Copy link
Contributor

ierezell commented Aug 23, 2022

System Info

  • transformers version: 4.21.1
  • Platform: Linux-5.19.2-arch1-1-x86_64-with-glibc2.36
  • Python version: 3.9.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.11.0+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@Narsil
@LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce :

  1. Finetune a model for token classification.
  2. Get the results as follow with AggregationStrategy.SIMPLE:

Input : "Hi, can you book me flight ticket ? Also, will there be some snacks available on the plane ? Many thanks."

[
    {"entity_group": "lbl1", "score": 0.0, "word": "Hi",  "start": 0, "end": 2},
    {"entity_group": "lbl2", "score": 0.0, "word": ",", "start": 2, "end": 3},
    {"entity_group": "lbl3", "score": 0.0, "word": "can you book me flight ticket?", "start": 4, "end": 35},
    {"entity_group": "lbl2", "score": 0.0, "word": "Also", "start": 36, "end": 40},
    {"entity_group": "lbl3", "score": 0.0, "word": ", will there be some snacks available on the plane?", "start": 40, "end": 92},
    {"entity_group": "lbl1", "score": 0.0, "word": "Many thanks.", "start": 93, "end": 105}
]

Note that the start and end of the Aggregated Output are good ! But the "word" is not.

  1. Do the same thing without aggregation and get :
[
//...
{'entity': 'lbl3', 'score': 0.0, 'index': 3, 'word': 'can', 'start': 4, 'end': 7}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 4, 'word': 'you', 'start': 8, 'end': 11}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 5, 'word': 'book', 'start': 12, 'end': 16},
{'entity': 'lbl3', 'score': 0.0, 'index': 6, 'word': 'me', 'start': 17, 'end': 19}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 7, 'word': 'flight', 'start': 20, 'end': 26}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 8, 'word': 'ticket', 'start': 27, 'end': 33}, 
{'entity': 'lbl3', 'score': 0.0, 'index': 9, 'word': '?', 'start': 34, 'end': 35}, 
//...
]

This sentence with the index is well extracted as input[27:33]="ticket" + input[33:34] = " " + input[34:35]="?" = "ticket ?"

Expected behavior

Having the returned word span matches exactly the original utterance as we could use indexes and text span in the rest of our code.

Thanks in advance,
Have a great day :)

@ierezell ierezell added the bug label Aug 23, 2022
@Narsil
Copy link
Contributor

Narsil commented Aug 25, 2022

HI @ierezell ,

It does make sense what you asking for, but it's probably not going to happen since it would be a backward incompatibilty.
Pinging core maintainers to get advice on this @sgugger @LysandreJik.

"word" was an output since the creation of the pipeline and corresponds to the tokens being decoded: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py#L409

This means that during decoding anything can happen to the string and there is no reason for it to be like the original string.

What we could do, is use the offsets instead to lookup in the original string what was actually in there, but that would break existing pipelines relying on this code to actually use decode.

  • The non breaking solution would be to add a new key "original_word" (or a better name) that does the indexing for you. But that's now 2 different strings in the returned results, and that's not super obvious which one is better (they're going to be the same most of the time).

  • Do nothing (current solution).

  • Break backward compatbility, that's probably for v5.

There are a lot of quirks in the pipelines like this which are not huge changes, but that we still can't do because it would mean breaking user code.

@ierezell
Copy link
Contributor Author

ierezell commented Aug 25, 2022

Hello @Narsil,

I understood the underlying problem as I checked the code before opening the issue and saw the "tokenizer problem".

It's not a "huge" problem as anyone can still reconstruct the "original" spans with starts and stops as they're "correct".
However, maybe a little bit more doc on this behavior can help users avoid these misconceptions about the returned data.

For the solutions :

  1. For my use case would be best, but for 90% of the other users, it's too many troubles for the team to maintain, and 2. is enough as start and stop, used correctly is doing what I need.
  2. In my opinion the best, but as said above may be documenting that to tell the users to use start and stop to get spans instead of innocently relying on the "word" if their application needs to do processing with the original utterance.
  3. Would totally solve the issue but for sure let's keep that for later as it's breaking.

I totally understand that you need to keep a stable API, and I understand the underlying problem. I just wanted to flag that innocent people like me can misuse the returned data.

Thanks for the information and help :)
You can close the issue if you want, as 2. is enough and you're aware of the use case.

Have a great day

@Narsil
Copy link
Contributor

Narsil commented Aug 25, 2022

Thanks for understanding, I'll let core maintainers conclude on this.

For the documentation, would this be better/enough ?
#18763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants