Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

Conversation

anthony2261
Copy link
Contributor

What does this PR do?

Fixes #18726

Code to reproduce the error: https://colab.research.google.com/drive/1ETpz8UP42r7HjRg4VUkC7L8ou10qY3bQ?usp=sharing

Specific combination that caused the error: LayoutXLMProcessor with use_fast=False, return_overflowing_tokens=True and return_offsets_mapping=False

Who can review?

@NielsRogge

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 26, 2022

The documentation is not available anymore as the PR was closed or merged.

@NielsRogge
Copy link
Contributor

Thanks, although it doesn't include the same changes as #17092, could you add those?

@anthony2261
Copy link
Contributor Author

Thanks, although it doesn't include the same changes as #17092, could you add those?

The additional changes in the other PR were already present, I added the missing ones, see the pictures below
This PR:
image

The other PR:
image

@NielsRogge
Copy link
Contributor

NielsRogge commented Aug 26, 2022

Oh yeah, just realized I added that myself when adding LayoutLMv3. Thanks! I'll add another reviewer before merging

@NielsRogge NielsRogge requested a review from sgugger August 26, 2022 12:46
@anthony2261
Copy link
Contributor Author

anthony2261 commented Aug 26, 2022

Btw, one issue I have with this is that if we're using the processor with use_fast=False, we won't be allowed to process data with return_overflowing_tokens=True, because this will force us to set return_offsets_mapping=True, and doing that will raise this error:

NotImplementedError: return_offset_mapping is not available when using Python tokenizers.
To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.
More information on available tokenizers at https://github.com/huggingface/transformers/pull/2674

I was able to get 1-to-1 token mappings with the non-fast tokenizer but in a very hacky way. If I find a better way, I'll suggest doing that to support return_overflowing_tokens with regular tokenizers.

N.B: This is the same case with LMv2, not sure about LMv3

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good to me! Thank you, @anthony2261!

@LysandreJik LysandreJik merged commit a98f6a1 into huggingface:main Aug 30, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError 'overflow_to_sample_mapping' when using LayoutXLM with regular Tokenizer + return_overflowing_tokens
4 participants