Add document token classification pipeline (#1) #21012

vaishak2future · 2023-01-04T21:31:22Z

What does this PR do?

Adds Pipeline for Document Token Classification. Code is mostly based on PR for Document Question Answering. #18414

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Narsil

HuggingFaceDocBuilderDev · 2023-01-04T21:48:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Narsil · 2023-01-05T11:08:52Z

Hi @vaishak2future

Did you know that layoutlm already implements object-detection : https://huggingface.co/Narsil/layoutlmv3-finetuned-funsd

This might be close enough to this, no ?

vaishak2future · 2023-01-06T17:56:14Z

@Narsil , thank you for looking at the PR. While Object Detection does solve this particular instance of the problem, we see Document Token Classification as a multimodal task separate from the unimodal task of Object Detection. Document Token Classification requires two modalities - an image and a set of tokens.

This gives control to the user to use their OCR of choice (especially for languages that are not well handled by Tesseract), but also to choose their own tokens that might not be text on the image itself.

vaishak2future · 2023-01-09T22:05:05Z

@Narsil All checks are now passing. Could you please review? Thanks.

Narsil · 2023-01-16T12:36:24Z

Hi @vaishak2future ,

I understand the ideas to remove the Tesseract where needed. For the extra tokens, where you imagining extracting tokens from PDF directly maybe ? (This was also an idea behind document-question-answering where the idea is that we could always fuse the pipeline later with regular visual-question-answering).

Here there are a few things that make me hesitant:

Pipelines are made to be usable by non ML programmers, here, it's kind of tricky since tokens and boxes and such are quite ML involved
Pipelines are made to be relatively generic over different model types, here only layoutlm would work as-is. The idea is to keep the number of pipelines relatively small, so discoverable by users.

That being said, enabling power users like your use case should be supported IMO. I would have to look at how to implement within object-detection. But I don't see any issue with adding extra parameters for such niche, but extremely useful use-cases.
For instance asr pipeline enables users to send the raw audio frames directly which IMO is seemingly the same idea (bypass or modify very specifically some preprocessing which would be the OCR in your case)

What do you think ?

Pinging @sgugger @LysandreJik for other opinions on this.

Regardless, I briefly looked at the PR, the code seems good, there are a few nits regarding how tests are structured and how many different inputs are accepted, but overall it looks quite good. I'll delay my comments after we reach a decision on this as there's no big structural blockers on my end imo.

sgugger · 2023-01-16T14:10:30Z

This looks very specific to one model. We can't host all possible pipelines in Transformers, so in such a case, we should rely on the code on the Hub for pipeline feature. You can see pointers here.

github-actions · 2023-02-10T15:02:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Add document token classification pipeline (#1)

e9d8717

style changes

3b3db18

vaishak2future added 4 commits January 6, 2023 10:05

Update document_token_classification.py

a77e4c5

Update test_pipelines_document_token_classification.py

5dffffe

Update document_token_classification.py

4fe4ec1

Update update_metadata.py

682d1bc

github-actions bot closed this Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add document token classification pipeline (#1) #21012

Add document token classification pipeline (#1) #21012

vaishak2future commented Jan 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 4, 2023

Narsil commented Jan 5, 2023

vaishak2future commented Jan 6, 2023

vaishak2future commented Jan 9, 2023

Narsil commented Jan 16, 2023 •

edited

Loading

sgugger commented Jan 16, 2023

github-actions bot commented Feb 10, 2023

Add document token classification pipeline (#1) #21012

Add document token classification pipeline (#1) #21012

Conversation

vaishak2future commented Jan 4, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 4, 2023

Narsil commented Jan 5, 2023

vaishak2future commented Jan 6, 2023

vaishak2future commented Jan 9, 2023

Narsil commented Jan 16, 2023 • edited Loading

sgugger commented Jan 16, 2023

github-actions bot commented Feb 10, 2023

vaishak2future commented Jan 4, 2023 •

edited

Loading

Narsil commented Jan 16, 2023 •

edited

Loading