Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LayoutLM #9476

Merged
merged 7 commits into from
Jan 12, 2021
Merged

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Jan 8, 2021

What does this PR do?

  • Improve documentation of LayoutLM, explaining how people can normalize bounding boxes before passing them to the model, add links to the various datasets on which the model achieves state-of-the-art results, add code examples in the documentation for the various models
  • Add notebook to the list of community notebooks showcasing how to fine-tune LayoutLMForTokenClassification on the FUNSD dataset (on which the model achieves SOTA results)
  • Add integration tests, which confirm that the model outputs the same tensors as the original implementation on the same input data
  • Add LayoutLMForSequenceClassification, which makes it possible to fine-tune LayoutLM for document image classification tasks (such as the RVL-CLIP dataset), extra tests included.

Fixes the following issues:

Who can review?

@LysandreJik, @patrickvonplaten, @sgugger

def test_LayoutLM_backward_pass_reduces_loss(self):
"""Test loss/gradients same as reference implementation, for example."""
pass
self.assertTrue(torch.allclose(outputs.loss, expected_loss, atol=0.1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atol=1e-3 would not pass here?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR! Thanks so much for taking care of this. The notebook looks great as well. Left a couple of comments. If possible it would be awesome if we could make the example a bit more concise (e.g. to just use tokenizer(...) instead of tokenize(...) and conevrt_tokens_to_ids(...).

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this!

docs/source/model_doc/layoutlm.rst Outdated Show resolved Hide resolved
src/transformers/models/layoutlm/modeling_layoutlm.py Outdated Show resolved Hide resolved
src/transformers/models/layoutlm/modeling_layoutlm.py Outdated Show resolved Hide resolved
src/transformers/models/layoutlm/modeling_layoutlm.py Outdated Show resolved Hide resolved
tests/test_modeling_layoutlm.py Outdated Show resolved Hide resolved
tests/test_modeling_layoutlm.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job @NielsRogge! Thanks a lot for your contribution!

src/transformers/models/layoutlm/modeling_layoutlm.py Outdated Show resolved Hide resolved
src/transformers/models/layoutlm/modeling_layoutlm.py Outdated Show resolved Hide resolved
tests/test_modeling_layoutlm.py Show resolved Hide resolved
@NielsRogge NielsRogge force-pushed the Improving_layoutlm_v2 branch from 7eed265 to fffc19a Compare January 8, 2021 17:10
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jan 8, 2021

Thanks for the reviews, I've addressed all comments. There are 2 things remaining:

  • in the code examples, I use both tokenize() and convert_tokens_to_ids as the bounding boxes (which are at word-level) need to be converted to token-level. Is there a better solution?
words = ["Hello", "world"]
normalized_word_boxes = [637, 773, 693, 782], [698, 773, 733, 782]
tokens = []
token_boxes = []
for word, box in zip(words, normalized_word_boxes):
    word_tokens = tokenizer.tokenize(word)
    tokens.extend(word_tokens)
    token_boxes.extend([box] * len(word_tokens))
  • according to @sgugger the input data on which the integration tests are ran are maybe too long, and black formatting causes them to be flattened vertically. Could you maybe fix this @LysandreJik?

@LysandreJik
Copy link
Member

I pushed the reformat you asked for @NielsRogge, make sure to pull before doing any more changes!

@NielsRogge
Copy link
Contributor Author

Ok thank you, so the only thing remaining is make the code examples more efficient? Is there a way to make the code block (see comment above) better?

@LysandreJik LysandreJik merged commit e45eba3 into huggingface:master Jan 12, 2021
guyrosin pushed a commit to guyrosin/transformers that referenced this pull request Jan 15, 2021
* Add LayoutLMForSequenceClassification and integration tests

Improve docs

Add LayoutLM notebook to list of community notebooks

* Make style & quality

* Address comments by @sgugger, @patrickvonplaten and @LysandreJik

* Fix rebase with master

* Reformat in one line

* Improve code examples as requested by @patrickvonplaten

Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants