-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayoutLMv2Processor: ensure 1-to-1 mapping between images and samples in case of overflowing tokens #17092
Conversation
…samples and images in LayoutLMv2Processor
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but let's wait for @NielsRogge to have a look too!
assert len(images_with_overflow) == len( | ||
overflow_to_sample_mapping | ||
), f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new assert in the code base, please use a test and raise a ValueError
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a test and changed the assert
to ValueError
(running make style
led to quite a few formatting changes in test_processor_layoutlmv2.py
for some reason).
The reason for no assert
is because assert
is mostly used only for debugging purpose (detecting programmer error), which can get muted during production, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for improving!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you have the exact same version as black as is pinned in our setup? The CI check for style is passing on master, so none of the reformatting unrelated to the changes in your PR is necessary.
The easiest might be to revert your last commit once you have made sure of the version of black, as black doesn't undo the lines it adds.
|
||
if len(images_with_overflow) != len(overflow_to_sample_mapping): | ||
raise ValueError( | ||
f"Expected length of images to be the same as the length of overflow_to_sample_mapping, but got {len(images_with_overflow)} and {len(overflow_to_sample_mapping)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split that message on several lines to respect the 119 char limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Is there a reason that the CI can't catch cases where strings exceed the 119 char limit right now? (I know black doesn't enforce line-length limit on strings by default)
Fixed the formatting issue too (turns out that my VSCode uses a different version of black to autoformat on save, which running make style
with the HF pinned black version afterwards can't undo).
…g unrelated formatting changes
Thanks again for your contribution! |
… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black
… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black
Thanks for handling this @ghlai9665 |
… in case of overflowing tokens (huggingface#17092) * add get_overflowing_images function to ensure 1-to-1 mapping between samples and images in LayoutLMv2Processor * make style * add test for overflowing_tokens, change assert to ValueError, avoiding unrelated formatting changes * change line length by passing --preview into black
What does this PR do?
Fixes #13554
Problem re-summarized: when
return_offsets_mapping
is set to True,LayoutLMv2Processor
would break up sequences that are too long into multipleinput_ids
sequences, causing a mismatch betweeninput_ids
(longer in length in the case of overflowing tokens) andimages
.This fix would ensure the 1-to-1 mapping between the
images
andinput_ids
.Reproducible Example: (The assertion at the end would fail without the fix, pass with the fix)
Required Input from Reviewers
Right now, the LayoutLMv2Processor would return a list for
encoded_inputs["image"]
, regardless of the value ofreturn_tensors
. If we want it to return a torch tensor in the casereturn_tensors=="pt"
, we have totorch.stack
the list (and do similar thing to support "np" and "tf").Should I implement this in
get_overflowing_images
? Or should I just leave the return type as list and just print a warning?Who can review?
@NielsRogge @sgugger @LysandreJik
P.S.
The
test_processor_case_1
intest_processor_layoutlmv2.py
fails before this PR. I'd be happy to look at it as well but it's unrelated to this PR.