-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LayoutLM-based visual question answering model, weights, and pipeline #18380
Comments
cc @Narsil as well as @NielsRogge |
Thank you for this proposal ! It is really well thought out and everything you mention is pertinent.
Feel free to start the PRs and ping me as early on as you want (so I can help with the details). Here is doc on adding new pipelines, most of it is not necessary since Cheers, and thanks for the proposal ! |
@Narsil that's great to hear! I will start sending pieces as PRs and tag you for feedback. |
Re-opening this as we're still working on the pipeline. |
Feature request
Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like post-processing start/end candidates). It'd be amazing for question answering on documents to be that easy.
The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the distilbert-base-cased-distilled-squad model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's notebook example even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were as easy to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.
This will require a few additions, all of which I have working code for that I'd be happy to contribute:
QuestionAnsweringPipeline
orVisualQuestionAnsweringPipeline
pipeline to support document inputs. I think the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR).LayoutLMv2ForQuestionAnswering
andLayoutLMv3ForQuestionAnswering
to the pipeline. Ideally, there would also beLayoutLMForQuestionAnswering
, since v2 and v3 are not licensed for commercial use.Motivation
When we started using transformers, we saw the
question-answering
pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to this tutorial, however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it that easy for Document QA too?Your contribution
We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).
The text was updated successfully, but these errors were encountered: