LayoutLM-based visual question answering model, weights, and pipeline #18380

ankrgyl · 2022-07-31T16:09:35Z

Feature request

Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like post-processing start/end candidates). It'd be amazing for question answering on documents to be that easy.

The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the distilbert-base-cased-distilled-squad model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's notebook example even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were as easy to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.

This will require a few additions, all of which I have working code for that I'd be happy to contribute:

Extend the QuestionAnsweringPipeline or VisualQuestionAnsweringPipeline pipeline to support document inputs. I think the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR).
Hook up LayoutLMv2ForQuestionAnswering and LayoutLMv3ForQuestionAnswering to the pipeline. Ideally, there would also be LayoutLMForQuestionAnswering, since v2 and v3 are not licensed for commercial use.
Publish pre-trained model weights with an easy-to-follow model card. I found a few examples of fine-tuned layoutlm for QA models (e.g. this), but could not get them to run easily. For example, the "hosted inference API" UI throws an error when you try to run it. I think the visual question answering UI (which lets you load an image) might be a better fit. But I am very open to discussion on what the best experience would be.

Motivation

When we started using transformers, we saw the question-answering pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to this tutorial, however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it that easy for Document QA too?

Your contribution

We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).

The text was updated successfully, but these errors were encountered:

LysandreJik · 2022-08-01T09:42:49Z

cc @Narsil as well as @NielsRogge

Narsil · 2022-08-01T13:57:01Z

Thank you for this proposal !

It is really well thought out and everything you mention is pertinent.
Adding support would be really awesome !

We probably need to use VisualQuestionAnswering for this one. What defines a pipeline is the set of input/output so as far as I understand that would fit (image+question_text, output is a list of strings with scores attached, in decreasin order of top_k). Actually for this one, we might be able to return the bbox in addition so that we could visually show where the information is in the original document. (Optionally extra information is OK, but pipelines can't change the core input/output so that users can easily switch between models/architectures).
As far as I understand, the main reason we haven't already included the pipeline is because of the OCR. I think we actually can include it in the pipeline if it's easy to install (single dependency addition) and if we provide a clear error message when it's missing. We're already using ffmpeg for audio pipelines when it's missing, and kenlm when there's a n-gram layer with the model. Those are all pipeline specific so not necessary for transformers but they do make users' lives easier.
For differenciating between layout and other models, we tend not to focus on actual model names (like layoutLM but more on model ForXX name (ForDocumentQuestionAnswering maybe @NielsRogge ?), as they should have consistent API. So when a new model comes around and implements the same API, there's no additional work for the pipeline (99% of the time at least).

Feel free to start the PRs and ping me as early on as you want (so I can help with the details).

Here is doc on adding new pipelines, most of it is not necessary since vqa already exists but it should help with the overall design.
https://huggingface.co/docs/transformers/v4.21.0/en/add_new_pipeline#adding-it-to-the-list-of-supported-tasks2

Cheers, and thanks for the proposal !

ankrgyl · 2022-08-01T18:34:54Z

@Narsil that's great to hear! I will start sending pieces as PRs and tag you for feedback.

NielsRogge · 2022-08-31T08:11:22Z

Re-opening this as we're still working on the pipeline.

ankrgyl changed the title ~~Adding LayoutLM-based visual question answering model, weights, and pipeline~~ LayoutLM-based visual question answering model, weights, and pipeline Jul 31, 2022

This was referenced Aug 1, 2022

Add LayoutLMForQuestionAnswering model #18407

Merged

Add DocumentQuestionAnswering pipeline #18414

Merged

LysandreJik closed this as completed in #18407 Aug 31, 2022

NielsRogge reopened this Aug 31, 2022

sgugger closed this as completed in #18414 Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LayoutLM-based visual question answering model, weights, and pipeline #18380

LayoutLM-based visual question answering model, weights, and pipeline #18380

ankrgyl commented Jul 31, 2022 •

edited

Loading

LysandreJik commented Aug 1, 2022

Narsil commented Aug 1, 2022 •

edited

Loading

ankrgyl commented Aug 1, 2022

NielsRogge commented Aug 31, 2022

LayoutLM-based visual question answering model, weights, and pipeline #18380

LayoutLM-based visual question answering model, weights, and pipeline #18380

Comments

ankrgyl commented Jul 31, 2022 • edited Loading

Feature request

Motivation

Your contribution

LysandreJik commented Aug 1, 2022

Narsil commented Aug 1, 2022 • edited Loading

ankrgyl commented Aug 1, 2022

NielsRogge commented Aug 31, 2022

ankrgyl commented Jul 31, 2022 •

edited

Loading

Narsil commented Aug 1, 2022 •

edited

Loading