LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

anthony2261 · 2022-08-26T11:18:43Z

What does this PR do?

Code to reproduce the error: https://colab.research.google.com/drive/1ETpz8UP42r7HjRg4VUkC7L8ou10qY3bQ?usp=sharing

Specific combination that caused the error: LayoutXLMProcessor with use_fast=False, return_overflowing_tokens=True and return_offsets_mapping=False

Who can review?

@NielsRogge

… and add test for it

HuggingFaceDocBuilderDev · 2022-08-26T11:32:48Z

The documentation is not available anymore as the PR was closed or merged.

NielsRogge · 2022-08-26T12:06:28Z

Thanks, although it doesn't include the same changes as #17092, could you add those?

anthony2261 · 2022-08-26T12:17:07Z

Thanks, although it doesn't include the same changes as #17092, could you add those?

The additional changes in the other PR were already present, I added the missing ones, see the pictures below
This PR:

The other PR:

NielsRogge · 2022-08-26T12:46:42Z

Oh yeah, just realized I added that myself when adding LayoutLMv3. Thanks! I'll add another reviewer before merging

anthony2261 · 2022-08-26T13:06:11Z

Btw, one issue I have with this is that if we're using the processor with use_fast=False, we won't be allowed to process data with return_overflowing_tokens=True, because this will force us to set return_offsets_mapping=True, and doing that will raise this error:

NotImplementedError: return_offset_mapping is not available when using Python tokenizers.
To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.
More information on available tokenizers at https://github.com/huggingface/transformers/pull/2674

I was able to get 1-to-1 token mappings with the non-fast tokenizer but in a very hacky way. If I find a better way, I'll suggest doing that to support return_overflowing_tokens with regular tokenizers.

N.B: This is the same case with LMv2, not sure about LMv3

LysandreJik

Yes, this looks good to me! Thank you, @anthony2261!

… and add test for it (huggingface#18774)

LayoutXLMProcessor: ensure 1-to-1 mapping between samples and images,…

7fdecaa

… and add test for it

NielsRogge requested a review from sgugger August 26, 2022 12:46

LysandreJik approved these changes Aug 30, 2022

View reviewed changes

LysandreJik merged commit a98f6a1 into huggingface:main Aug 30, 2022

oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022

LayoutXLMProcessor: ensure 1-to-1 mapping between samples and images,…

595a9e8

… and add test for it (huggingface#18774)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

anthony2261 commented Aug 26, 2022

HuggingFaceDocBuilderDev commented Aug 26, 2022 •

edited

Loading

NielsRogge commented Aug 26, 2022

anthony2261 commented Aug 26, 2022

NielsRogge commented Aug 26, 2022 •

edited

Loading

anthony2261 commented Aug 26, 2022 •

edited

Loading

LysandreJik left a comment

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

Conversation

anthony2261 commented Aug 26, 2022

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 26, 2022 • edited Loading

NielsRogge commented Aug 26, 2022

anthony2261 commented Aug 26, 2022

NielsRogge commented Aug 26, 2022 • edited Loading

anthony2261 commented Aug 26, 2022 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 26, 2022 •

edited

Loading

NielsRogge commented Aug 26, 2022 •

edited

Loading

anthony2261 commented Aug 26, 2022 •

edited

Loading