LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

garyhlai · 2022-05-05T05:21:34Z

What does this PR do?

Problem re-summarized: when return_offsets_mapping is set to True, LayoutLMv2Processor would break up sequences that are too long into multiple input_ids sequences, causing a mismatch between input_ids (longer in length in the case of overflowing tokens) and images.

This fix would ensure the 1-to-1 mapping between the images and input_ids.

Reproducible Example: (The assertion at the end would fail without the fix, pass with the fix)

import transformers
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D, load_dataset
import torch

datasets = load_dataset("nielsr/funsd")
labels = datasets['train'].features['ner_tags'].feature.names
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']
  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True,
                             return_overflowing_tokens=True,
                             stride=50,
                             return_offsets_mapping=True,
                             return_tensors="pt")
  return encoded_inputs

train_data = preprocess_data(datasets["train"])

# this assert would fail without this PR fix. 
assert len(train_data["image"]) == len(train_data["input_ids"])

Required Input from Reviewers

Right now, the LayoutLMv2Processor would return a list for encoded_inputs["image"], regardless of the value of return_tensors. If we want it to return a torch tensor in the case return_tensors=="pt", we have to torch.stack the list (and do similar thing to support "np" and "tf").

Should I implement this in get_overflowing_images? Or should I just leave the return type as list and just print a warning?

Who can review?

@NielsRogge @sgugger @LysandreJik

P.S.

The test_processor_case_1 in test_processor_layoutlmv2.py fails before this PR. I'd be happy to look at it as well but it's unrelated to this PR.

…samples and images in LayoutLMv2Processor

HuggingFaceDocBuilderDev · 2022-05-05T05:38:39Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

LGTM but let's wait for @NielsRogge to have a look too!

sgugger · 2022-05-05T11:42:32Z

src/transformers/models/layoutlmv2/processing_layoutlmv2.py

+        assert len(images_with_overflow) == len(
+            overflow_to_sample_mapping
+        ), f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}"


No new assert in the code base, please use a test and raise a ValueError here.

Just added a test and changed the assert to ValueError (running make style led to quite a few formatting changes in test_processor_layoutlmv2.py for some reason).

The reason for no assert is because assert is mostly used only for debugging purpose (detecting programmer error), which can get muted during production, right?

Yes, exactly!

NielsRogge

LGTM, thanks for improving!

sgugger

Are you sure you have the exact same version as black as is pinned in our setup? The CI check for style is passing on master, so none of the reformatting unrelated to the changes in your PR is necessary.
The easiest might be to revert your last commit once you have made sure of the version of black, as black doesn't undo the lines it adds.

sgugger · 2022-05-06T11:35:03Z

src/transformers/models/layoutlmv2/processing_layoutlmv2.py

+
+        if len(images_with_overflow) != len(overflow_to_sample_mapping):
+            raise ValueError(
+                f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}"


Can you split that message on several lines to respect the 119 char limit?

Done. Is there a reason that the CI can't catch cases where strings exceed the 119 char limit right now? (I know black doesn't enforce line-length limit on strings by default)

Fixed the formatting issue too (turns out that my VSCode uses a different version of black to autoformat on save, which running make style with the HF pinned black version afterwards can't undo).

…g unrelated formatting changes

sgugger · 2022-05-09T11:39:16Z

Thanks again for your contribution!

… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black

timothyjlaurent · 2022-05-14T01:47:13Z

Thanks for handling this @ghlai9665

… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black

add get_overflowing_images function to ensure 1-to-1 mapping between …

1111e76

…samples and images in LayoutLMv2Processor

garyhlai mentioned this pull request May 5, 2022

LayoutLMv2 processing doesn't handle tokenizer overflow #13554

Closed

make style

212e032

sgugger approved these changes May 5, 2022

View reviewed changes

NielsRogge approved these changes May 5, 2022

View reviewed changes

sgugger reviewed May 6, 2022

View reviewed changes

add test for overflowing_tokens, change assert to ValueError, avoidin…

6574644

…g unrelated formatting changes

garyhlai force-pushed the main branch from 1351624 to 6574644 Compare May 6, 2022 22:05

change line length by passing --preview into black

330c8b8

sgugger merged commit e9fd583 into huggingface:main May 9, 2022

ducviet00 mentioned this pull request May 11, 2022

Add LayoutLMv3 #17060

Merged

5 tasks

This was referenced Aug 23, 2022

KeyError 'overflow_to_sample_mapping' when using LayoutXLM with regular Tokenizer + return_overflowing_tokens #18726

Closed

LayoutXLMProcessor: Enforce using "return_overflowing_tokens" with "return_offsets_mapping" #18774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

garyhlai commented May 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading

sgugger left a comment

sgugger May 5, 2022

garyhlai May 6, 2022 •

edited

Loading

sgugger May 6, 2022

NielsRogge left a comment

sgugger left a comment

sgugger May 6, 2022

garyhlai May 6, 2022 •

edited

Loading

sgugger commented May 9, 2022

timothyjlaurent commented May 14, 2022

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

Conversation

garyhlai commented May 5, 2022 • edited Loading

What does this PR do?

Required Input from Reviewers

Who can review?

P.S.

HuggingFaceDocBuilderDev commented May 5, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger May 5, 2022

Choose a reason for hiding this comment

garyhlai May 6, 2022 • edited Loading

Choose a reason for hiding this comment

sgugger May 6, 2022

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger May 6, 2022

Choose a reason for hiding this comment

garyhlai May 6, 2022 • edited Loading

Choose a reason for hiding this comment

sgugger commented May 9, 2022

timothyjlaurent commented May 14, 2022

garyhlai commented May 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading

garyhlai May 6, 2022 •

edited

Loading

garyhlai May 6, 2022 •

edited

Loading