Add DocumentQuestionAnswering pipeline #18414

ankrgyl · 2022-08-02T05:38:05Z

What does this PR do?

This PR extends VisualQuestionAnsweringPipeline to accept words and boxes as input, passes them into the tokenizer/model (along with the question), and post-processes their QuestionAnsweringModelOutput response.

Fixes #18380

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Narsil

ankrgyl · 2022-08-02T05:40:31Z

@Narsil this is basically a skeleton implementation that I thought I'd send out sooner than later to start getting your input.

I've left a few questions throughout tagged with "TODO" in the comments. The big question is how much/whether to reuse the code in QuestionAnsweringPipeline, which has a lot of overlap (notably preparing the spans and post-processing the output). For example, I could refactor out methods like QuestionAnsweringPipeline.decode to share the implementation, inherit from QuestionAnsweringPipeline, etc.

HuggingFaceDocBuilderDev · 2022-08-02T05:51:58Z

The documentation is not available anymore as the PR was closed or merged.

Narsil

Great PR despite my many comments (a lot are regarding the same items).

Thanks for this work!

The biggest things to make this PR easier to go through:

Don't do padding/truncating and let the model crash when it's too big. (It will require a new PR to support context splitting IMO)
Don't try to support inference without image, that should be a different pipeline.
Flat dicts at _forward boundary.
Add tests (both fast and slow, I can definitely help set them up)

Please let me know if anything is unclear or you're not sure how to fix, I can definitely help go further make code edits, you've done most of the work here.

For the is_split_into_words flag, don't worry about it, I tried to raise the concern but there's no real great solution anyway it seems so this will be good enough.

Narsil · 2022-08-02T07:52:08Z

src/transformers/pipelines/visual_question_answering.py

+        if (isinstance(image, (Image.Image, str)) or image is None) and isinstance(question, str):
+            inputs = {"question": question}
+            if image is not None:
+                inputs["image"] = image


I think the image should be mandatory for it to make sense in this pipeline.

How can the model do anything without an image ?

Optional inputs are really not great, since it makes pipeline usage sort of guesswork.
Optional inputs are called parameters and none should be mandatory (so at least there's common grounds for users to use a model with some sane defaults)

Maybe these models can implement both qa and vqa pipelines then ? (But the qa also requires a context instead of an image).

I'll rework this per our discussion about the role of the image vs. words/boxes.

Narsil · 2022-08-02T07:57:31Z

src/transformers/pipelines/visual_question_answering.py

-        image = load_image(inputs["image"])
-        model_inputs = self.tokenizer(
-            inputs["question"], return_tensors=self.framework, padding=padding, truncation=truncation
+    def preprocess(self, inputs, padding=False, truncation=False, words=[], boxes=[]):


Never use lists in default arguments (=[]):
https://docs.python-guide.org/writing/gotchas/

Also words and boxes are NOT parameters, they should come from OCRing the image. (IIUC)

parameters are something that should always be possible to enable when instantiating the whole pipeline

pipe = pipeline(...., arg=XX) for result in pipe(dataset): print(result)

Here it doesn't seem like words or boxes make sense in this context. They should come from the data itself.
If it becomes an actual inputs -> Different pipeline (Question answering maybe, but the names are going to be different, and no bounding boxes can be sent either for the same reason, maybe we can implement without sending boxes ? What are even boxes when there is no image ?)
If it comes from the image, everything seems fine.

OMG how did I not know that! Thank you for pointing it out. I'll fix the default argument piece and rework these into optional arguments per our discussion below.

Narsil · 2022-08-02T10:21:15Z

src/transformers/pipelines/visual_question_answering.py

+        if "words" in inputs:
+            # TODO: Can we refactor and share (or inherit from) the QuestionAnweringPipeline? I think we want to do
+            # the same thing with the tokenizer, except we have a few extra arguments (e.g. the bounding boxes).
+            extra_tokenizer_params = {
+                "return_token_type_ids": True,
+                "return_attention_mask": True,
+                "is_split_into_words": True,
+                # "max_length": 512,
+            }
+            padding = "max_length"
+            truncation = True
+            question = [question]
+            text_pair = inputs["words"]


Actually you shouldn't really override padding/truncation by default.

padding/truncation should be user controlled never done automatically (it has some performance implications both quality wise and speedwise). Pipelines should never hide such details (It's done on some pipelines but that's purely legacy). (We can provide some simple parameters arguments to workaround those issues)

Why did you choose to add these ? There might be another way to do the same thing.

Narsil · 2022-08-02T10:30:47Z

src/transformers/pipelines/visual_question_answering.py

+            padding = "max_length"
+            truncation = True
+            question = [question]
+            text_pair = inputs["words"]


is_split_into_words is really dangerous and should not be used as much as possible. It bypasses the pretokenizer of the tokenizer and means the actual resulting tokens might be very different from what the model is supposed to see. It's actually only used in very corner cases when there is no real other way.

Let's say your OCR detects "New York" as a single box, and the tokenizer is wordlevel and knows only "New" and "york" and uses a space splitting pre_tokenizer, then using is_split_into_words means the model is going to receive unk (since it cannot correctly split the space) instead of ["new", "york"].

If I understand correctly, you're trying to work around the fact that you cannot send a list of strings (the boxes found in the image) directly to the tokenizer. (There's only text + text_pair. )

Here we can probably workaround that fact by using a single string (instead of a list) and dealing with offsets instead.
Another option would be to do the post_processor "manually" (directly using the underlying tokenizer).

Both options are not great, but they would really avoid the caveats that is implied with is_split_into_words which IMO is really a danger here. (Most tokenizers have no concept of words, and depending of the OCR actual chunking it can cause issues as shown, the first option's issue is that the offsets boundary might not be on the boxes, ).

Since this is more involved, we can keep that as-is for this PR, which a comment noting the caveat and eventually fix later.
Maybe tokenizers can make the fix even simpler by simply accepting lists of strings instead of just 1 or a pair.

Okay yeah I totally agree with you. The reason I pass this argument in is that the tokenizer complains if I try to pass in an (unsplit) question and a (split) text_pair of words, and for this use case, we don't need to index within the question to produce an answer (accuracy/performance issues aside).

I believe the LayoutLMv3Tokenizer gets around this limitation by tokenizing the question separately from the words in this case. I appreciate you letting us keep it as is for this PR and then we can work on a better solution (which I'm happy to help contribute).

Narsil · 2022-08-02T10:36:03Z

src/transformers/pipelines/visual_question_answering.py

+                for i, s, w in zip(
+                    encoding.input_ids[batch_index],
+                    encoding.sequence_ids(batch_index),
+                    encoding.word_ids(batch_index),


Here you use word ids but as basically the box_id and the sequence_id as a flag to know if the token is in the context or not. Right ?

If yes, that's just an confirmation that there's a problem with the current code (totally understandable, you're working around limitations) because the "words" are not really words, but chunks of text in an image.

Again no real change needed, just emphasizing that this behavior will rely on the OCR and tokenizer working in specific ways.

Yes. You could read this code colloquially as "if the box belongs to a word in the doc (i.e. sequence id is 1) then assign it a bounding box based on its original (word id) index. If the box belongs to a separator token, assign it [1000,1000,1000,1000]. Otherwise, assign it to [0,0,0,0]".

The challenge you're raising (words vs. boxes in the image) is a very real challenge for document processing tools. Most do not support "within-box" predictions. Definitely a problem worth visiting in the future.

Narsil · 2022-08-02T10:59:39Z

src/transformers/pipelines/visual_question_answering.py

+
+    # TODO: This is a very poor implementation of start/end (just here for completeness sake).
+    # Ideally we can refactor/borrow the implementation in the question answering pipeline.
+    results = []


I really like the idea of reusing qa implementation here (especially since it's really non-trivial).

The easiest thing to do should be to extract the qa's code into a function and reuse it here if it's exactly the same.

I'll do that!

Narsil · 2022-08-02T11:01:29Z

src/transformers/pipelines/visual_question_answering.py

-    def __call__(self, image: Union["Image.Image", str], question: str = None, **kwargs):
+    def __call__(
+        self,
+        image: [Optional[Union["Image.Image", str]]] = None,


This needs to be set at call time. It's a visual question answering so it does require an image.

Will adjust per our discussion below. I think what we should do is make image a required argument, and then (without making it too obvious), allow an advanced user to pass None (or another empty-like value) and handle it gracefully within the implementation.

Narsil · 2022-08-02T11:04:41Z

src/transformers/pipelines/visual_question_answering.py

+    def __call__(
+        self,
+        image: [Optional[Union["Image.Image", str]]] = None,
+        question: Optional[Union[List[str], str]] = None,


This None here exists solely to support pipeline({"image": image, "question": question}) (Which is necessary when you want to work with datasets or lists directly).

It's not documented kind of on purpose, the documentation bodies showcases the various uses, but here the strict-typing just causes things to be less readable IMO, since if you do send it it should be a str.

I read the documentation below and the List options for question and image are actually wrong and shouldn't be there.

Okay cool. I'll just revert this since we'll place things in a new pipeline anyway.

Feel free to do the typing in the new pipeline. The most critical piece is the docstring (and this is the one where readability > completeness)

Narsil · 2022-08-02T11:10:10Z

src/transformers/pipelines/visual_question_answering.py

+        if "logits" in model_outputs:
+            return postprocess_sequence_output(self.model, model_outputs, self.framework, top_k)
+        elif "start_logits" in model_outputs and "end_logits" in model_outputs:
+            return postprocess_qa_output(


It's OK to do it this way.

This is directly linked to the ModelForXX api to return logits or start_logits.

Okay cool, thanks for the clarification.

Narsil · 2022-08-02T11:10:25Z

src/transformers/pipelines/visual_question_answering.py

-        scores = scores.tolist()
-        ids = ids.tolist()
-        return [{"score": score, "answer": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
+            assert False, "Unknown output format"


Again please raise real exceptions.

Will fix. These were meant to be temporary as I was sketching out the code.

No worries.

ankrgyl · 2022-08-02T13:48:32Z

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. For example, another option could be to add a new pipeline called DocumentQuestionAnswering (or similar) that handles inputs of this shape. Let me know your thoughts.

Narsil · 2022-08-02T14:30:48Z

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

Yes very much so !
I think it's great to have a clear conversation about this.

I will try and convey this part of the library's perspective, but having your view is great too since I probably know less about OCRs and overall document processing than you.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

This is very interesting to know !
I was under the impression that using an OCR could be streamlined much more.

The fact that the OCR has much impact on the quality of the results doesn't surprise me (and the is_split_into_words might play a non negligible role here)

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

Yes I felt the same thing when reading your code and pondering whether it should actually belonged or not in QA instead of VQA. I think you are right, the image information is too valuable to be thrown away.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Very reasonable ! :)

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. Let me know your thoughts.

Ok, I am going to recap the main goal of the pipeline:

A pipeline is a tool to make ML accessible to non ML practitioners.

That's the first goal, and in doing that, we don't want to hide any ML details that might hurt users unknowingly (like chunking things that can hurt output quality without being opt-in). So hide as many things as possible, when the defaults are correct, but don't hide any magic that could be use-case dependent. For instance, truncating without asking for explicit user consent (via a parameter) means the user will try and send large chunks of texts, and get an output that will correspond only to a tiny chunk of it, without him realizing it.

Another secondary goal is to make them as reusable/extendable as possible, but only when it doesn't contradict any of the previous goals.

With that in mind, you see why having inputs/outputs that depend on the actual model type, forces non ML practitioners to know about model types, where the goal is to try and lift that burden. If we can ensure that sending the same input, will receive the same output, it means users can jump very easily between models. So when AwesomeModelA comes out, you can just swap its name and make it work. Same goes for iterations/fine-tuning of the same model or different models and so one.

Here I can see I think two solutions:

1/ We create a new pipeline (DocumentQuestionAnsweringPipeline ?). The set of I/O is different so we should have different pipelines for these. For this pipeline it seems the input is boxes + words (which I would call texts personally as OCRs probably extract full string and don't necessarily reason about words). It's easy, but puts all the burden of the OCR on the user upfront.(If OCR choice is super tricky and we cannot realistically make that choice in a general fashion for users, it's probably the way to go).

2/ We keep using VisualQuestionAnswering but we enable a very easy way to use a custom OCR:

Most users will trigger an initial error that pytesseract (or something else) is not present and get suggested to install it to get an easy impression about results (mention all the caveats/link to some docs on how to choose the OCR for advanced users).
When those sane defaults are present, the pipelines will use those.
For experienced users that know about how OCR can impact deeply the results we can enable easy overriding like:

pipe = pipeline("mymodel-id", ocr=MyOCR())

class MyOCR:
    def forward(self, image):
       ...
       return texts, boxes

What do you think ? Which solution makes most sense from your perspective ?

Also regardless of choice here, we can extract whatever makes sense as an individual function within qa so you can reuse it, in a pipeline or anywhere else.

ankrgyl · 2022-08-02T14:37:52Z

For Option 1, to clarify, would you be open to allowing the user to pass in (optional) words and boxes? I think this is conceptually similar to your point about audio pipelines using ffmpeg but I may be misunderstanding something. Essentially, we'd run OCR on the image if the words/boxes are not passed in. And either way, depending on the model, pass the image into the model as well. If we made the words/boxes an optional input, then users could basically assert control where they'd like to, but the simple use cases will just work out of the box.

Personally, I think option 1 is the way to go. I can at least sketch out the code as a next step and we can take a look and reevaluate.

Narsil · 2022-08-02T15:00:26Z

would you be open to allowing the user to pass in (optional) words and boxes

I have the same uneasiness with any optional inputs. Either the pipeline needs the data or it doesn't. IMO the incoming data should be as strongly typed as possible, and definitely the computation should not depend on what the user actually sent (because then it becomes really hard to reason about what actually happened on a piece of data, which OCR was used ? Were the boxes correct ? etc...).

I feel like I am missing a piece of the puzzle here, so maybe we can do the other way around, let's try to devise what we would actually like to write as a user for this document processing.

IMO the simplest is something like:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Or maybe be a little more strict:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
# ValueError : This model is a document processing model, and requires an OCR to be able to use this pipeline,
# please pass an OCR. For demos, you can use `from transformers.pipelines import DefaultOCR`

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", ocr=DefaultOCR())

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}, ...]

ankrgyl · 2022-08-02T15:44:50Z

Ahh, okay, yes I agree that working from these examples is really helpful. Let me first be precise about what is required vs. not:

In all LayoutLM models, words and bounding boxes are technically required. The model itself requires them to be formatted a certain way (e.g. box coordinates are axis aligned and normalized between 0->1000), but it does not impose where they came from. The inspiration is something like "BERT + bounding boxes".
In LayoutLMv2 and v3, the models additionally accept an image (normalized to 224x224) as input. Theoretically, the model is able to use information from the image alongside the encoded words and boxes. Notably, in LayoutLMv1, you do not need to provide the image. And furthermore, you can fine tune v2 and v3 for many use cases without the additional image and achieve similar or in some cases better results.
The LayoutLMv2 and LayoutLMv3 processor classes in transformers optionally accept an apply_ocr argument. If set to True, while doing feature extraction from the image, they'll also use the tesseract library to run OCR and return them back out to caller, so you can pass them into the model. There is some tricky control flow throughout these classes that branches based on whether the user provides their own OCR or not.

I think part of why it's structured this way, or at least one of the advantages, is that in practice, since OCR can be costly (time and $$), many document processing practitioners will run OCR as a pre-processing step, so you can reuse its results across many invocations of extractions/questions/etc. E.g. imagine you were building an app that lets you point at a folder on your computer and then ask the files questions interactively. You'd probably implement this app by first running OCR on each file, and then re-using the OCR each time a user provides a new question as input.

I think with this in mind, there are probably a few different use cases that would be ideal to capture in the pipeline. I fully recognize that some of these may qualify as "more advanced" than the scope of a pipeline, so I'm open to and appreciate your push back on where that may be the case.

Scenario 1: Single file, single question

(your example above)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Scenario 2: Interactive REPL

(this is an approximation of a real-world use case)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

img = Image.load("id_card.jpg")
words, boxes = my_favorite_ocr(img)
while True:
  question = input("Ask a question of the image: ")
  print(pipe(image=img, question=question, words=words, boxes=boxes)

Scenario 3: Mixed Media Types

img = rasterize("my_tax_form.pdf")
words, boxes = text_extract("my_tax_form.pdf")

# NOTE: in certain models, e.g. LayoutLMv1, you do not even need to rasterize/pass in the image as input in this case
out = pipe(image=img, question="What is the person's income?", words=words, boxes=boxes)
# out == [{answer: "$10", score:0.8}, {"answer": "$1000", "score": 0.1}]

I can certainly imagine some alternatives:

Words/boxes could be required inputs, and we could simply enforce that the user run OCR (or alternative) before using the pipeline. I think in this case, the image should be considered optional input, simply because certain document processing models take it as input, and others don't.
Another would be to allow the user to provide a more advanced "OCR" input that could accept things like PDFs, spreadsheets, etc. and let it call out to OCR or use something else depending on the media type. I would say from experience, handling various document types is a can of worms and it prevents you from reusing pre-processed results across calls to the pipeline. (I believe this is your second suggestion).
My original suggestion: words/boxes could be optional, and when not provided, we use a default OCR implementation. One more advantage of this approach is that it's consistent with the LayoutLMv2 processor classes. So if a user starts with this pipeline, and then wants to dig one level deeper to the processor, they'll have a familiar pattern.

Let me know your thoughts. In certain options (namely the first), I think it'd be a requirement for it to be a DocumentQuestionAnsweringPipeline since the required inputs are different than the VisualQuestionAnsweringPipeline. In options 2 or 3, that might not be the case. I don't have a strong opinion about this but just wanted to clarify my understanding/thinking.

Narsil · 2022-08-02T18:05:09Z

Ok, thanks for all the explanation !

Now I think I am starting to understand it and all the use cases you displayed really make sense !

I think we can ignore layoutlmv1 not requiring the image so we can keep the number of cases rather small. (If you really know what you're doing you could always send an empty image, or we could just make the code in such a way that sending None doesn't break anything without actively trying to sanitize it)

Since the OCR is indeed quite costly (or can come from a non image !) I can really understand why we would need those optional boxes and texts. So let's support them. (We can make the docs extremely clear on that front)

I think example 1 should really be the focus for newcoming users, and we need to be able to support example 2 and example 3 to be usable in prod.

And if a user sends boxes + texts then we can simply skip the OCR part.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

I will ping @NielsRogge which is also really involved in vision and might have other insights.

ankrgyl · 2022-08-02T18:33:37Z

Awesome, I really appreciate you taking the time to dig into this with me. I'll sketch this all out as a next step. And I agree that we can leave the empty image (or None image) as a performance optimization for advanced users. The one thing we'll need to be careful of is that the LayoutLMv1 model gets upset if you do pass in the image (i.e. it's optional for v2/v3 but not for v1 -- v1 does not accept images at all). So if the workaround is to pass in an empty image, we'll just need to figure out a way to cleverly avoid handing it to the model (e.g. implement a no-op feature extractor that takes the image as input and returns an empty dict).

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either, although I am leaning a bit towards the latter so that we can be very clear with the examples/documentation about the use cases (and not muddy the waters with the VisualQuestionAnsweringPipeline which operates directly on the image each time). But of course, I'm open either way.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error_

I have no concerns with this. The runtime "perf hit" of converting one format to the other is trivial compared to the other operations involved. I think it's a smart way to prevent an accidental length mismatch.

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

Great point. I'm happy to contribute these.

Narsil · 2022-08-02T19:19:48Z

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either,

Go with DocumentQuestionAnsweringPipeline for now then. In general we try to avoid adding pipelines when we can and when the set of input/output is the same as it makes discoverability and usability on hf.co easier/more consistent.

But you made great points explaining core differences (especially the pdf example for instance), IMO.
If we decide to revisit later or other members have different opinions, we might revisit later (we would do the lifting, and since we're committed to zero breaking change you would still be able to use your code regardless of internal decisions)

ankrgyl · 2022-08-02T19:34:19Z

Okay great, as a next step I'll rework this PR to sketch out DocumentQuestionAnsweringPipeline and address some of your comments on the original change (but may not do all in the first pass, just to optimize for getting feedback sooner). Thanks again for the back and forth and look forward to the next iteration!

ankrgyl · 2022-08-04T06:02:58Z

I just pushed an update that moves the logic into a new DocumentQuestionAnsweringPipeline. I still need to do a few major things:

Integrate OCR
Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)
Integrate the post-processing from the QA pipeline.

I did some sanity testing with a model we've trained and can confirm that it is starting to work! I think we're headed in the right direction.

Narsil · 2022-08-04T15:03:23Z

Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

ankrgyl · 2022-08-04T17:08:16Z

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

I'm happy to do it that way if you prefer -- my working assumption was that the "padding" argument to the tokenizer accomplishes the same thing (but certainly may be missing some interesting implementation detail).

Narsil · 2022-08-05T12:40:32Z

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

Ok, this is what I alluded to, QA solves this by using return_overflowing_tokens (and the padding is set to do_no_pad by default).

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/question_answering.py#L283

QA solves this by using ChunkPipeline.
If you want to tackle this, you're more than welcome to it, but IMO it's going to be easier to do it in two steps, and two separate PRs.

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

ankrgyl · 2022-08-05T14:36:09Z

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

That plan works for me! I'll provide a new update shortly.

ankrgyl · 2022-08-05T15:32:37Z

@Narsil I just updated the PR with a few changes that remove the padding/striding stuff (for now) and add some docs. The next steps are to integrate OCR and then refactor/borrow the post-processing code from the QA pipeline. I'll keep working on that but wanted to post an intermediate update in case you had a chance to take a quick look.

ankrgyl · 2022-08-05T16:31:52Z

@Narsil another thought / question I had while working on the OCR stuff... Currently, both LayoutLMv2 and v3 have a feature extractor which by default applies OCR. By incorporating OCR into the pipeline itself (which I'm doing by just borrowing their code), we essentially take over that functionality. So, a user may have to do something like this:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", tokenizer="layoutlmv3-xxx", feature_extractor=AutoFeatureExtractor.from_pretrained("layoutlmv3-xxx", apply_ocr=False))

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Essentially, we'll want to rely on the pipeline's OCR, not the feature extractor's. However as a result, we make the user experience a bit awkward (since they have to provide "apply_ocr" False in one place). I can think of a few solutions to this:

We could rely on the user providing a feature extractor as input, and then invoke the feature extractor in preprocess(), essentially following the conventions that LayoutLMv2Processor/LayoutLMv3Processor do (call the feature extractor and then the tokenizer). If they provide neither a feature extractor nor words, we can provide a helpful error message that they must provide a feature extractor that returns words. One major downside to this approach is that users of models like LayoutLMv1 will not ever get OCR run for them by the pipeline, but I'm open to implementing a feature extractor for LayoutLMv1 to solve this.
If they provide a feature extractor, we could try to check whether it'll run OCR (e.g. by checking whether its "apply_ocr" attribute is True). If it will, then we can rely on the feature extractor to provide words/boxes. If not, and they haven't passed in words to the pipeline, then we can run OCR. I think the major downside is depending on a non-standard flag (apply_ocr) in the generic pipeline code. I'm not sure how you all think about this tradeoff -- it may be fine to do. A slight variant of this is to test whether after running the feature extractor, we have words and boxes available in its output.
We could just ignore this altogether and let the user be the expert. I.e. if they pass in a feature extractor and have not specified apply_ocr=False, it will run OCR twice (once in the pipeline and once in the feature extractor), which is an unnecessary perf hit, but makes no assumptions about the feature extractor itself.

Let me know your thoughts.

ankrgyl · 2022-08-06T23:45:53Z

@Narsil I think I've implemented all of what we talked about (and apologies in advance if I missed anything). To summarize:

Padding/truncation are gone. I've left them commented out, because we plan to address them as a follow up (in this or a fast-follow PR), but I'm happy to remove those comments too if you prefer.
OCR is integrated. Regarding my question just above, I went down the route of option 2, and check whether the feature extractor returned words/boxes before trying to run OCR, which the pipeline natively supports.
I refactored the tricky postprocessing parts of the QA pipeline into helper functions which I call from the document question answering pipeline.
I've copied the relevant subsets of the code (including PR Add LayoutLMForQuestionAnswering model #18407) and published it here with some examples. Feel free to play around with it!

As a next step, I'd appreciate a quick review from you on these major points to verify whether we're on the right track. I'd like to add the tests and more documentation next (pending your feedback on if we are in a good place with the interface/overall design). I also have a few questions regarding the tests/docs:

The existing tests for both question-answering and visual-question-answering use models published on HF. There aren't (currently that I could find) any reliablen doc qa models. I have published one here, but there's a bit of a soft dependency on PR Add LayoutLMForQuestionAnswering model #18407 because the model we're publishing uses LayoutLMv1. You can access the model w/ remote code enabled, but I'm not sure that's advisable for a test in the repo. It'd also be good to have tests that span multiple models (e.g. v1-v3) because there are some differences in their tokenizers.
Is there any way to use a processor in a pipeline? The reason I ask is that LayoutLMv2 and v3 have some interesting logic encapsulated in their processors (e.g. LayoutLMv2 renames the input to the model from image to pixel_values and v3 to image_features). It'd be great to reuse the logic in those classes within the pipeline. Alternatively, I could just support LayoutLMv1 to start with and we can work on adding support for the other versions in a follow up PR.
Should I add docs anywhere other than the code itself (which I assume would show up here)? For example a place like here) as a tutorial for how document question answering works?

ankrgyl · 2022-08-09T14:07:59Z

@Narsil gentle nudge in case this slipped from your queue :)

NielsRogge

So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?

Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?

ankrgyl · 2022-08-09T14:43:42Z

So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?

Yes that's correct.

Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?

The interface should support it. As input, you provide an image+question (and optional word/box pairs if you've pre-run OCR) and as output you receive an answer + start/end words. For a generative model, I could imagine either omitting the start/end or the pipeline doing its best to find it in the document if it exists.

Code-wise, there may be some refactoring within the pipeline implementation to best support a model like Donut. Very happy to collaborate with you on that.

ankrgyl · 2022-09-05T23:55:43Z

@NielsRogge @Narsil @sgugger @patrickvonplaten I think we are pretty much there -- I just updated to address the outstanding comments, and the failing tests look like they are flaky (unrelated to this change). Let me know if there's anything else needed on this PR.

sgugger

Thanks for adding this new pipeline. Left a couple of small comments and then it should be good to merge.

There a lot of followups/TODOs in this PR, might be worth opening an issue to not lose track of them.

sgugger · 2022-09-06T14:16:00Z

src/transformers/pipelines/__init__.py

+        "default": {
+            "model": {
+                "pt": ("impira/layoutlm-document-qa", "3a93017")
+            },  # TODO Update with custom pipeline removed, just before we land


Is this still relevant?

No, good catch, removed.

src/transformers/pipelines/document_question_answering.py

sgugger · 2022-09-06T14:18:05Z

src/transformers/pipelines/document_question_answering.py

+    def preprocess(
+        self,
+        input,
+        lang=None,
+        tesseract_config="",
+    ):


Suggested change

def preprocess(

self,

input,

lang=None,

tesseract_config="",

):

def preprocess(self, input, lang=None, tesseract_config=""):

Nit: fits in one line.

Fixed (seems like Black was keeping it on separate lines because of the trailing comma)

sgugger · 2022-09-06T14:19:09Z

src/transformers/pipelines/document_question_answering.py

+            )
+
+        if isinstance(self.model, VisionEncoderDecoderModel):
+            task_prompt = f'<s_docvqa><s_question>{input["question"]}</s_question><s_answer>'


Can we refine the test to be more specific to Donut then? Vision encoder decoder is a wide class.

sgugger · 2022-09-06T14:19:39Z

src/transformers/pipelines/document_question_answering.py

+            if "boxes" not in tokenizer_kwargs:
+                bbox = []
+                for batch_index in range(num_spans):
+                    for i, s, w in zip(


Can we use better names here?

sgugger · 2022-09-06T14:19:47Z

src/transformers/pipelines/document_question_answering.py

+        return {
+            **encoding,
+            "p_mask": p_mask,
+            "word_ids": word_ids,
+            "words": words,
+        }


Fits in one line.

sgugger · 2022-09-06T14:20:12Z

src/transformers/pipelines/document_question_answering.py

+    def postprocess_encoder_decoder(self, model_outputs, **kwargs):
+        # postprocess
+        sequence = self.tokenizer.batch_decode(model_outputs.sequences)[0]
+        sequence = sequence.replace(self.tokenizer.eos_token, "").replace(self.tokenizer.pad_token, "")


sgugger · 2022-09-06T14:20:33Z

src/transformers/pipelines/document_question_answering.py

+        )
+
+        word_ids = model_outputs["word_ids"][0]
+        for s, e, score in zip(starts, ends, scores):


Can we use better names for s (start) and e (end) here?

ankrgyl · 2022-09-06T14:27:34Z

Thanks @sgugger. Should I open one issue with all of them or a separate issue for each TODO? Happy to do that.

sgugger · 2022-09-06T14:35:13Z

I think one issue with the list of TODOs is fine.

ankrgyl · 2022-09-07T16:57:25Z

Here is the task with follow ups: #18926.

sgugger · 2022-09-07T17:39:00Z

Thanks again for your contribution!

* [WIP] Skeleton of VisualQuestionAnweringPipeline extended to support LayoutLM-like models * Fixup * Use the full encoding * Basic refactoring to DocumentQuestionAnsweringPipeline * Cleanup * Improve args, docs, and implement preprocessing * Integrate OCR * Refactor question_answering pipeline * Use refactored QA code in the document qa pipeline * Fix tests * Some small cleanups * Use a string type annotation for Image.Image * Update encoding with image features * Wire through the basic docs * Handle invalid response * Handle empty word_boxes properly * Docstring fix * Integrate Donut model * Fixup * Incorporate comments * Address comments * Initial incorporation of tests * Address Comments * Change assert to ValueError * Comments * Wrap `score` in float to make it JSON serializable * Incorporate AutoModeLForDocumentQuestionAnswering changes * Fixup * Rename postprocess function * Fix auto import * Applying comments * Improve docs * Remove extra assets and add copyright * Address comments Co-authored-by: Ankur Goyal <ankur@impira.com>

arun007chauhan · 2023-08-07T21:21:29Z

can you please help me,
i am stuck,
let me explain,
if there is a grocery list in an invoice, I want the entire list as answer

but I tried a lot, it only give 1st value only

ankrgyl changed the title ~~Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM)~~ [WIP] Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM) Aug 2, 2022

Narsil reviewed Aug 2, 2022

View reviewed changes

ankrgyl mentioned this pull request Aug 3, 2022

Add LayoutLMForQuestionAnswering model #18407

Merged

5 tasks

ankrgyl force-pushed the llm-qa branch from 2edd4ec to 626d4fa Compare August 4, 2022 06:01

ankrgyl force-pushed the llm-qa branch from 2ef062d to c26f216 Compare August 6, 2022 22:20

NielsRogge reviewed Aug 9, 2022

View reviewed changes

ankrgyl force-pushed the llm-qa branch 3 times, most recently from b6ef1bd to bc6ee38 Compare August 13, 2022 19:29

ankrgyl added 5 commits September 5, 2022 14:23

Wrap score in float to make it JSON serializable

23f6600

Incorporate AutoModeLForDocumentQuestionAnswering changes

0168f3a

Fixup

92f641b

Rename postprocess function

2fe7a8c

Fix auto import

a59fbd3

ankrgyl force-pushed the llm-qa branch 3 times, most recently from 7aaff82 to a3074c2 Compare September 5, 2022 22:37

Applying comments

6c94556

ankrgyl force-pushed the llm-qa branch from a3074c2 to 6c94556 Compare September 5, 2022 22:50

ankrgyl added 2 commits September 5, 2022 16:20

Improve docs

08193d3

Remove extra assets and add copyright

d271829

ankrgyl force-pushed the llm-qa branch from 61e2611 to d271829 Compare September 5, 2022 23:32

sgugger approved these changes Sep 6, 2022

View reviewed changes

ankrgyl force-pushed the llm-qa branch from fa8d2a4 to d24d2db Compare September 7, 2022 16:54

ankrgyl mentioned this pull request Sep 7, 2022

Follow ups to DocumentQuestionAnswering Pipeline #18926

Open

8 tasks

Address comments

2a2bf09

ankrgyl force-pushed the llm-qa branch from d24d2db to 2a2bf09 Compare September 7, 2022 16:57

sgugger merged commit 2ef7742 into huggingface:main Sep 7, 2022

ankrgyl mentioned this pull request Sep 26, 2022

Implement multiple span support for DocumentQuestionAnswering #19204

Merged

4 tasks

This was referenced Dec 29, 2022

Add document token classification pipeline loop-payments/transformers#1

Merged

Add document token classification pipeline (#1) #21012

Closed

cdreetz mentioned this pull request Sep 28, 2024

Add AudioQuestionAnswering pipeline #33782

Open

Add DocumentQuestionAnswering pipeline #18414

Add DocumentQuestionAnswering pipeline #18414

Conversation

ankrgyl commented Aug 2, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

ankrgyl commented Aug 2, 2022

HuggingFaceDocBuilderDev commented Aug 2, 2022 • edited Loading

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankrgyl commented Aug 2, 2022 • edited Loading

Narsil commented Aug 2, 2022

ankrgyl commented Aug 2, 2022

Narsil commented Aug 2, 2022

ankrgyl commented Aug 2, 2022 • edited Loading

Scenario 1: Single file, single question

Scenario 2: Interactive REPL

Scenario 3: Mixed Media Types

Narsil commented Aug 2, 2022

ankrgyl commented Aug 2, 2022 • edited Loading

Narsil commented Aug 2, 2022

ankrgyl commented Aug 2, 2022

ankrgyl commented Aug 4, 2022 • edited Loading

Narsil commented Aug 4, 2022

ankrgyl commented Aug 4, 2022

Narsil commented Aug 5, 2022 • edited Loading

ankrgyl commented Aug 5, 2022

ankrgyl commented Aug 5, 2022

ankrgyl commented Aug 5, 2022 • edited Loading

ankrgyl commented Aug 6, 2022 • edited Loading

ankrgyl commented Aug 9, 2022

NielsRogge left a comment

Choose a reason for hiding this comment

ankrgyl commented Aug 9, 2022

ankrgyl commented Sep 5, 2022

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankrgyl commented Sep 6, 2022

sgugger commented Sep 6, 2022

ankrgyl commented Sep 7, 2022

sgugger commented Sep 7, 2022

arun007chauhan commented Aug 7, 2023

ankrgyl commented Aug 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 2, 2022 •

edited

Loading

Narsil left a comment •

edited

Loading

ankrgyl commented Aug 2, 2022 •

edited

Loading

ankrgyl commented Aug 2, 2022 •

edited

Loading

ankrgyl commented Aug 2, 2022 •

edited

Loading

ankrgyl commented Aug 4, 2022 •

edited

Loading

Narsil commented Aug 5, 2022 •

edited

Loading

ankrgyl commented Aug 5, 2022 •

edited

Loading

ankrgyl commented Aug 6, 2022 •

edited

Loading