Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document token classification pipeline (#1) #21012

Closed

Conversation

vaishak2future
Copy link

@vaishak2future vaishak2future commented Jan 4, 2023

What does this PR do?

Adds Pipeline for Document Token Classification. Code is mostly based on PR for Document Question Answering. #18414

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Narsil

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Narsil
Copy link
Contributor

Narsil commented Jan 5, 2023

Hi @vaishak2future

Did you know that layoutlm already implements object-detection : https://huggingface.co/Narsil/layoutlmv3-finetuned-funsd

This might be close enough to this, no ?

@vaishak2future
Copy link
Author

@Narsil , thank you for looking at the PR. While Object Detection does solve this particular instance of the problem, we see Document Token Classification as a multimodal task separate from the unimodal task of Object Detection. Document Token Classification requires two modalities - an image and a set of tokens.

This gives control to the user to use their OCR of choice (especially for languages that are not well handled by Tesseract), but also to choose their own tokens that might not be text on the image itself.

@vaishak2future
Copy link
Author

@Narsil All checks are now passing. Could you please review? Thanks.

@Narsil
Copy link
Contributor

Narsil commented Jan 16, 2023

Hi @vaishak2future ,

I understand the ideas to remove the Tesseract where needed. For the extra tokens, where you imagining extracting tokens from PDF directly maybe ? (This was also an idea behind document-question-answering where the idea is that we could always fuse the pipeline later with regular visual-question-answering).

Here there are a few things that make me hesitant:

  • Pipelines are made to be usable by non ML programmers, here, it's kind of tricky since tokens and boxes and such are quite ML involved
  • Pipelines are made to be relatively generic over different model types, here only layoutlm would work as-is. The idea is to keep the number of pipelines relatively small, so discoverable by users.

That being said, enabling power users like your use case should be supported IMO. I would have to look at how to implement within object-detection. But I don't see any issue with adding extra parameters for such niche, but extremely useful use-cases.
For instance asr pipeline enables users to send the raw audio frames directly which IMO is seemingly the same idea (bypass or modify very specifically some preprocessing which would be the OCR in your case)

What do you think ?

Pinging @sgugger @LysandreJik for other opinions on this.

Regardless, I briefly looked at the PR, the code seems good, there are a few nits regarding how tests are structured and how many different inputs are accepted, but overall it looks quite good. I'll delay my comments after we reach a decision on this as there's no big structural blockers on my end imo.

@sgugger
Copy link
Collaborator

sgugger commented Jan 16, 2023

This looks very specific to one model. We can't host all possible pipelines in Transformers, so in such a case, we should rely on the code on the Hub for pipeline feature. You can see pointers here.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants