Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutLM-based visual question answering model, weights, and pipeline #18380

Closed
ankrgyl opened this issue Jul 31, 2022 · 4 comments · Fixed by #18407 or #18414
Closed

LayoutLM-based visual question answering model, weights, and pipeline #18380

ankrgyl opened this issue Jul 31, 2022 · 4 comments · Fixed by #18407 or #18414

Comments

@ankrgyl
Copy link
Contributor

ankrgyl commented Jul 31, 2022

Feature request

Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like post-processing start/end candidates). It'd be amazing for question answering on documents to be that easy.

The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the distilbert-base-cased-distilled-squad model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's notebook example even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were as easy to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.

This will require a few additions, all of which I have working code for that I'd be happy to contribute:

  1. Extend the QuestionAnsweringPipeline or VisualQuestionAnsweringPipeline pipeline to support document inputs. I think the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR).
  2. Hook up LayoutLMv2ForQuestionAnswering and LayoutLMv3ForQuestionAnswering to the pipeline. Ideally, there would also be LayoutLMForQuestionAnswering, since v2 and v3 are not licensed for commercial use.
  3. Publish pre-trained model weights with an easy-to-follow model card. I found a few examples of fine-tuned layoutlm for QA models (e.g. this), but could not get them to run easily. For example, the "hosted inference API" UI throws an error when you try to run it. I think the visual question answering UI (which lets you load an image) might be a better fit. But I am very open to discussion on what the best experience would be.

Motivation

When we started using transformers, we saw the question-answering pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to this tutorial, however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it that easy for Document QA too?

Your contribution

We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).

@ankrgyl ankrgyl changed the title Adding LayoutLM-based visual question answering model, weights, and pipeline LayoutLM-based visual question answering model, weights, and pipeline Jul 31, 2022
@LysandreJik
Copy link
Member

cc @Narsil as well as @NielsRogge

@Narsil
Copy link
Contributor

Narsil commented Aug 1, 2022

Thank you for this proposal !

It is really well thought out and everything you mention is pertinent.
Adding support would be really awesome !

  • We probably need to use VisualQuestionAnswering for this one. What defines a pipeline is the set of input/output so as far as I understand that would fit (image+question_text, output is a list of strings with scores attached, in decreasin order of top_k). Actually for this one, we might be able to return the bbox in addition so that we could visually show where the information is in the original document. (Optionally extra information is OK, but pipelines can't change the core input/output so that users can easily switch between models/architectures).
  • As far as I understand, the main reason we haven't already included the pipeline is because of the OCR. I think we actually can include it in the pipeline if it's easy to install (single dependency addition) and if we provide a clear error message when it's missing. We're already using ffmpeg for audio pipelines when it's missing, and kenlm when there's a n-gram layer with the model. Those are all pipeline specific so not necessary for transformers but they do make users' lives easier.
  • For differenciating between layout and other models, we tend not to focus on actual model names (like layoutLM but more on model ForXX name (ForDocumentQuestionAnswering maybe @NielsRogge ?), as they should have consistent API. So when a new model comes around and implements the same API, there's no additional work for the pipeline (99% of the time at least).

Feel free to start the PRs and ping me as early on as you want (so I can help with the details).

Here is doc on adding new pipelines, most of it is not necessary since vqa already exists but it should help with the overall design.
https://huggingface.co/docs/transformers/v4.21.0/en/add_new_pipeline#adding-it-to-the-list-of-supported-tasks2

Cheers, and thanks for the proposal !

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 1, 2022

@Narsil that's great to hear! I will start sending pieces as PRs and tag you for feedback.

@NielsRogge
Copy link
Contributor

Re-opening this as we're still working on the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants