Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092

Merged
merged 4 commits into from
May 9, 2022

Conversation

garyhlai
Copy link
Contributor

@garyhlai garyhlai commented May 5, 2022

What does this PR do?

Fixes #13554

Problem re-summarized: when return_offsets_mapping is set to True, LayoutLMv2Processor would break up sequences that are too long into multiple input_ids sequences, causing a mismatch between input_ids (longer in length in the case of overflowing tokens) and images.

This fix would ensure the 1-to-1 mapping between the images and input_ids.

Reproducible Example: (The assertion at the end would fail without the fix, pass with the fix)

import transformers
from PIL import Image
from transformers import LayoutLMv2Processor
from datasets import Features, Sequence, ClassLabel, Value, Array2D, Array3D, load_dataset
import torch

datasets = load_dataset("nielsr/funsd")
labels = datasets['train'].features['ner_tags'].feature.names
id2label = {v: k for v, k in enumerate(labels)}
label2id = {k: v for v, k in enumerate(labels)}

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")

def preprocess_data(examples):
  images = [Image.open(path).convert("RGB") for path in examples['image_path']]
  words = examples['words']
  boxes = examples['bboxes']
  word_labels = examples['ner_tags']
  encoded_inputs = processor(images, words, boxes=boxes, word_labels=word_labels,
                             padding="max_length", truncation=True,
                             return_overflowing_tokens=True,
                             stride=50,
                             return_offsets_mapping=True,
                             return_tensors="pt")
  return encoded_inputs

train_data = preprocess_data(datasets["train"])

# this assert would fail without this PR fix. 
assert len(train_data["image"]) == len(train_data["input_ids"])

Required Input from Reviewers

Right now, the LayoutLMv2Processor would return a list for encoded_inputs["image"], regardless of the value of return_tensors. If we want it to return a torch tensor in the case return_tensors=="pt", we have to torch.stack the list (and do similar thing to support "np" and "tf").

Should I implement this in get_overflowing_images? Or should I just leave the return type as list and just print a warning?

Who can review?

@NielsRogge @sgugger @LysandreJik

P.S.

The test_processor_case_1 in test_processor_layoutlmv2.py fails before this PR. I'd be happy to look at it as well but it's unrelated to this PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 5, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's wait for @NielsRogge to have a look too!

Comment on lines 141 to 143
assert len(images_with_overflow) == len(
overflow_to_sample_mapping
), f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new assert in the code base, please use a test and raise a ValueError here.

Copy link
Contributor Author

@garyhlai garyhlai May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a test and changed the assert to ValueError (running make style led to quite a few formatting changes in test_processor_layoutlmv2.py for some reason).

The reason for no assert is because assert is mostly used only for debugging purpose (detecting programmer error), which can get muted during production, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly!

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for improving!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you have the exact same version as black as is pinned in our setup? The CI check for style is passing on master, so none of the reformatting unrelated to the changes in your PR is necessary.
The easiest might be to revert your last commit once you have made sure of the version of black, as black doesn't undo the lines it adds.


if len(images_with_overflow) != len(overflow_to_sample_mapping):
raise ValueError(
f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split that message on several lines to respect the 119 char limit?

Copy link
Contributor Author

@garyhlai garyhlai May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Is there a reason that the CI can't catch cases where strings exceed the 119 char limit right now? (I know black doesn't enforce line-length limit on strings by default)

Fixed the formatting issue too (turns out that my VSCode uses a different version of black to autoformat on save, which running make style with the HF pinned black version afterwards can't undo).

@sgugger sgugger merged commit e9fd583 into huggingface:main May 9, 2022
@sgugger
Copy link
Collaborator

sgugger commented May 9, 2022

Thanks again for your contribution!

nandwalritik pushed a commit to nandwalritik/transformers that referenced this pull request May 10, 2022
… in case of overflowing tokens (huggingface#17092)

* add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor

* make style

* add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes

* change line length by passing --preview into black
@ducviet00 ducviet00 mentioned this pull request May 11, 2022
5 tasks
Narsil pushed a commit to Narsil/transformers that referenced this pull request May 12, 2022
… in case of overflowing tokens (huggingface#17092)

* add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor

* make style

* add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes

* change line length by passing --preview into black
@timothyjlaurent
Copy link

Thanks for handling this @ghlai9665

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
… in case of overflowing tokens (huggingface#17092)

* add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor

* make style

* add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes

* change line length by passing --preview into black
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LayoutLMv2 processing doesn't handle tokenizer overflow
5 participants