Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DocumentQuestionAnswering pipeline #18414

Merged
merged 34 commits into from
Sep 7, 2022
Merged

Conversation

ankrgyl
Copy link
Contributor

@ankrgyl ankrgyl commented Aug 2, 2022

What does this PR do?

This PR extends VisualQuestionAnsweringPipeline to accept words and boxes as input, passes them into the tokenizer/model (along with the question), and post-processes their QuestionAnsweringModelOutput response.

Fixes #18380

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@Narsil

@ankrgyl ankrgyl changed the title Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM) [WIP] Extend VisualQuestionAnsweringPipeline to support QuestionAnwering models (e.g. LayoutLM) Aug 2, 2022
@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

@Narsil this is basically a skeleton implementation that I thought I'd send out sooner than later to start getting your input.

I've left a few questions throughout tagged with "TODO" in the comments. The big question is how much/whether to reuse the code in QuestionAnsweringPipeline, which has a lot of overlap (notably preparing the spans and post-processing the output). For example, I could refactor out methods like QuestionAnsweringPipeline.decode to share the implementation, inherit from QuestionAnsweringPipeline, etc.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 2, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR despite my many comments (a lot are regarding the same items).

Thanks for this work!

The biggest things to make this PR easier to go through:

  • Don't do padding/truncating and let the model crash when it's too big. (It will require a new PR to support context splitting IMO)
  • Don't try to support inference without image, that should be a different pipeline.
  • Flat dicts at _forward boundary.
  • Add tests (both fast and slow, I can definitely help set them up)

Please let me know if anything is unclear or you're not sure how to fix, I can definitely help go further make code edits, you've done most of the work here.

For the is_split_into_words flag, don't worry about it, I tried to raise the concern but there's no real great solution anyway it seems so this will be good enough.

if (isinstance(image, (Image.Image, str)) or image is None) and isinstance(question, str):
inputs = {"question": question}
if image is not None:
inputs["image"] = image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the image should be mandatory for it to make sense in this pipeline.

How can the model do anything without an image ?

Optional inputs are really not great, since it makes pipeline usage sort of guesswork.
Optional inputs are called parameters and none should be mandatory (so at least there's common grounds for users to use a model with some sane defaults)

Maybe these models can implement both qa and vqa pipelines then ? (But the qa also requires a context instead of an image).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rework this per our discussion about the role of the image vs. words/boxes.

image = load_image(inputs["image"])
model_inputs = self.tokenizer(
inputs["question"], return_tensors=self.framework, padding=padding, truncation=truncation
def preprocess(self, inputs, padding=False, truncation=False, words=[], boxes=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never use lists in default arguments (=[]):
https://docs.python-guide.org/writing/gotchas/

Also words and boxes are NOT parameters, they should come from OCRing the image. (IIUC)

parameters are something that should always be possible to enable when instantiating the whole pipeline

pipe = pipeline(...., arg=XX)
for result in pipe(dataset):
    print(result)

Here it doesn't seem like words or boxes make sense in this context. They should come from the data itself.
If it becomes an actual inputs -> Different pipeline (Question answering maybe, but the names are going to be different, and no bounding boxes can be sent either for the same reason, maybe we can implement without sending boxes ? What are even boxes when there is no image ?)
If it comes from the image, everything seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG how did I not know that! Thank you for pointing it out. I'll fix the default argument piece and rework these into optional arguments per our discussion below.

Comment on lines 164 to 176
if "words" in inputs:
# TODO: Can we refactor and share (or inherit from) the QuestionAnweringPipeline? I think we want to do
# the same thing with the tokenizer, except we have a few extra arguments (e.g. the bounding boxes).
extra_tokenizer_params = {
"return_token_type_ids": True,
"return_attention_mask": True,
"is_split_into_words": True,
# "max_length": 512,
}
padding = "max_length"
truncation = True
question = [question]
text_pair = inputs["words"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you shouldn't really override padding/truncation by default.

padding/truncation should be user controlled never done automatically (it has some performance implications both quality wise and speedwise). Pipelines should never hide such details (It's done on some pipelines but that's purely legacy). (We can provide some simple parameters arguments to workaround those issues)

Why did you choose to add these ? There might be another way to do the same thing.

padding = "max_length"
truncation = True
question = [question]
text_pair = inputs["words"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_split_into_words is really dangerous and should not be used as much as possible. It bypasses the pretokenizer of the tokenizer and means the actual resulting tokens might be very different from what the model is supposed to see. It's actually only used in very corner cases when there is no real other way.

Let's say your OCR detects "New York" as a single box, and the tokenizer is wordlevel and knows only "New" and "york" and uses a space splitting pre_tokenizer, then using is_split_into_words means the model is going to receive unk (since it cannot correctly split the space) instead of ["new", "york"].

If I understand correctly, you're trying to work around the fact that you cannot send a list of strings (the boxes found in the image) directly to the tokenizer. (There's only text + text_pair. )

Here we can probably workaround that fact by using a single string (instead of a list) and dealing with offsets instead.
Another option would be to do the post_processor "manually" (directly using the underlying tokenizer).

Both options are not great, but they would really avoid the caveats that is implied with is_split_into_words which IMO is really a danger here. (Most tokenizers have no concept of words, and depending of the OCR actual chunking it can cause issues as shown, the first option's issue is that the offsets boundary might not be on the boxes, ).

Since this is more involved, we can keep that as-is for this PR, which a comment noting the caveat and eventually fix later.
Maybe tokenizers can make the fix even simpler by simply accepting lists of strings instead of just 1 or a pair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah I totally agree with you. The reason I pass this argument in is that the tokenizer complains if I try to pass in an (unsplit) question and a (split) text_pair of words, and for this use case, we don't need to index within the question to produce an answer (accuracy/performance issues aside).

I believe the LayoutLMv3Tokenizer gets around this limitation by tokenizing the question separately from the words in this case. I appreciate you letting us keep it as is for this PR and then we can work on a better solution (which I'm happy to help contribute).

for i, s, w in zip(
encoding.input_ids[batch_index],
encoding.sequence_ids(batch_index),
encoding.word_ids(batch_index),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you use word ids but as basically the box_id and the sequence_id as a flag to know if the token is in the context or not. Right ?

If yes, that's just an confirmation that there's a problem with the current code (totally understandable, you're working around limitations) because the "words" are not really words, but chunks of text in an image.

Again no real change needed, just emphasizing that this behavior will rely on the OCR and tokenizer working in specific ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You could read this code colloquially as "if the box belongs to a word in the doc (i.e. sequence id is 1) then assign it a bounding box based on its original (word id) index. If the box belongs to a separator token, assign it [1000,1000,1000,1000]. Otherwise, assign it to [0,0,0,0]".

The challenge you're raising (words vs. boxes in the image) is a very real challenge for document processing tools. Most do not support "within-box" predictions. Definitely a problem worth visiting in the future.


# TODO: This is a very poor implementation of start/end (just here for completeness sake).
# Ideally we can refactor/borrow the implementation in the question answering pipeline.
results = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of reusing qa implementation here (especially since it's really non-trivial).

The easiest thing to do should be to extract the qa's code into a function and reuse it here if it's exactly the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that!

def __call__(self, image: Union["Image.Image", str], question: str = None, **kwargs):
def __call__(
self,
image: [Optional[Union["Image.Image", str]]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be set at call time. It's a visual question answering so it does require an image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adjust per our discussion below. I think what we should do is make image a required argument, and then (without making it too obvious), allow an advanced user to pass None (or another empty-like value) and handle it gracefully within the implementation.

def __call__(
self,
image: [Optional[Union["Image.Image", str]]] = None,
question: Optional[Union[List[str], str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This None here exists solely to support pipeline({"image": image, "question": question}) (Which is necessary when you want to work with datasets or lists directly).

It's not documented kind of on purpose, the documentation bodies showcases the various uses, but here the strict-typing just causes things to be less readable IMO, since if you do send it it should be a str.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the documentation below and the List options for question and image are actually wrong and shouldn't be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool. I'll just revert this since we'll place things in a new pipeline anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to do the typing in the new pipeline. The most critical piece is the docstring (and this is the one where readability > completeness)

Comment on lines 236 to 239
if "logits" in model_outputs:
return postprocess_sequence_output(self.model, model_outputs, self.framework, top_k)
elif "start_logits" in model_outputs and "end_logits" in model_outputs:
return postprocess_qa_output(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to do it this way.

This is directly linked to the ModelForXX api to return logits or start_logits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay cool, thanks for the clarification.

scores = scores.tolist()
ids = ids.tolist()
return [{"score": score, "answer": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]
assert False, "Unknown output format"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again please raise real exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. These were meant to be temporary as I was sketching out the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. For example, another option could be to add a new pipeline called DocumentQuestionAnswering (or similar) that handles inputs of this shape. Let me know your thoughts.

@Narsil
Copy link
Contributor

Narsil commented Aug 2, 2022

@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) image argument at a high level.

Yes very much so !
I think it's great to have a clear conversation about this.

I will try and convey this part of the library's perspective, but having your view is great too since I probably know less about OCRs and overall document processing than you.

The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario.

This is very interesting to know !
I was under the impression that using an OCR could be streamlined much more.

The fact that the OCR has much impact on the quality of the results doesn't surprise me (and the is_split_into_words might play a non negligible role here)

Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into context), you actually lose some valuable information about how the words are separated in the document (including the distance between them). In other words -- it's very important that you retain information about which words correspond to which coordinates.

Yes I felt the same thing when reading your code and pondering whether it should actually belonged or not in QA instead of VQA. I think you are right, the image information is too valuable to be thrown away.

I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere.

Very reasonable ! :)

Hopefully that makes sense and adds some context for why I proposed making image optional. I'm very open to alternate solutions too, but just wanted to clarify the use case a bit. Let me know your thoughts.

Ok, I am going to recap the main goal of the pipeline:

A pipeline is a tool to make ML accessible to non ML practitioners.

That's the first goal, and in doing that, we don't want to hide any ML details that might hurt users unknowingly (like chunking things that can hurt output quality without being opt-in). So hide as many things as possible, when the defaults are correct, but don't hide any magic that could be use-case dependent. For instance, truncating without asking for explicit user consent (via a parameter) means the user will try and send large chunks of texts, and get an output that will correspond only to a tiny chunk of it, without him realizing it.

Another secondary goal is to make them as reusable/extendable as possible, but only when it doesn't contradict any of the previous goals.

With that in mind, you see why having inputs/outputs that depend on the actual model type, forces non ML practitioners to know about model types, where the goal is to try and lift that burden. If we can ensure that sending the same input, will receive the same output, it means users can jump very easily between models. So when AwesomeModelA comes out, you can just swap its name and make it work. Same goes for iterations/fine-tuning of the same model or different models and so one.

Here I can see I think two solutions:

1/ We create a new pipeline (DocumentQuestionAnsweringPipeline ?). The set of I/O is different so we should have different pipelines for these. For this pipeline it seems the input is boxes + words (which I would call texts personally as OCRs probably extract full string and don't necessarily reason about words). It's easy, but puts all the burden of the OCR on the user upfront.(If OCR choice is super tricky and we cannot realistically make that choice in a general fashion for users, it's probably the way to go).

2/ We keep using VisualQuestionAnswering but we enable a very easy way to use a custom OCR:

  • Most users will trigger an initial error that pytesseract (or something else) is not present and get suggested to install it to get an easy impression about results (mention all the caveats/link to some docs on how to choose the OCR for advanced users).
  • When those sane defaults are present, the pipelines will use those.
  • For experienced users that know about how OCR can impact deeply the results we can enable easy overriding like:
pipe = pipeline("mymodel-id", ocr=MyOCR())

class MyOCR:
    def forward(self, image):
       ...
       return texts, boxes

What do you think ? Which solution makes most sense from your perspective ?

Also regardless of choice here, we can extract whatever makes sense as an individual function within qa so you can reuse it, in a pipeline or anywhere else.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

For Option 1, to clarify, would you be open to allowing the user to pass in (optional) words and boxes? I think this is conceptually similar to your point about audio pipelines using ffmpeg but I may be misunderstanding something. Essentially, we'd run OCR on the image if the words/boxes are not passed in. And either way, depending on the model, pass the image into the model as well. If we made the words/boxes an optional input, then users could basically assert control where they'd like to, but the simple use cases will just work out of the box.

Personally, I think option 1 is the way to go. I can at least sketch out the code as a next step and we can take a look and reevaluate.

@Narsil
Copy link
Contributor

Narsil commented Aug 2, 2022

would you be open to allowing the user to pass in (optional) words and boxes

I have the same uneasiness with any optional inputs. Either the pipeline needs the data or it doesn't. IMO the incoming data should be as strongly typed as possible, and definitely the computation should not depend on what the user actually sent (because then it becomes really hard to reason about what actually happened on a piece of data, which OCR was used ? Were the boxes correct ? etc...).

I feel like I am missing a piece of the puzzle here, so maybe we can do the other way around, let's try to devise what we would actually like to write as a user for this document processing.

IMO the simplest is something like:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Or maybe be a little more strict:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
# ValueError : This model is a document processing model, and requires an OCR to be able to use this pipeline,
# please pass an OCR. For demos, you can use `from transformers.pipelines import DefaultOCR`

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", ocr=DefaultOCR())

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}, ...]

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

Ahh, okay, yes I agree that working from these examples is really helpful. Let me first be precise about what is required vs. not:

  • In all LayoutLM models, words and bounding boxes are technically required. The model itself requires them to be formatted a certain way (e.g. box coordinates are axis aligned and normalized between 0->1000), but it does not impose where they came from. The inspiration is something like "BERT + bounding boxes".
  • In LayoutLMv2 and v3, the models additionally accept an image (normalized to 224x224) as input. Theoretically, the model is able to use information from the image alongside the encoded words and boxes. Notably, in LayoutLMv1, you do not need to provide the image. And furthermore, you can fine tune v2 and v3 for many use cases without the additional image and achieve similar or in some cases better results.
  • The LayoutLMv2 and LayoutLMv3 processor classes in transformers optionally accept an apply_ocr argument. If set to True, while doing feature extraction from the image, they'll also use the tesseract library to run OCR and return them back out to caller, so you can pass them into the model. There is some tricky control flow throughout these classes that branches based on whether the user provides their own OCR or not.

I think part of why it's structured this way, or at least one of the advantages, is that in practice, since OCR can be costly (time and $$), many document processing practitioners will run OCR as a pre-processing step, so you can reuse its results across many invocations of extractions/questions/etc. E.g. imagine you were building an app that lets you point at a folder on your computer and then ask the files questions interactively. You'd probably implement this app by first running OCR on each file, and then re-using the OCR each time a user provides a new question as input.

I think with this in mind, there are probably a few different use cases that would be ideal to capture in the pipeline. I fully recognize that some of these may qualify as "more advanced" than the scope of a pipeline, so I'm open to and appreciate your push back on where that may be the case.

Scenario 1: Single file, single question

(your example above)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Scenario 2: Interactive REPL

(this is an approximation of a real-world use case)

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")

img = Image.load("id_card.jpg")
words, boxes = my_favorite_ocr(img)
while True:
  question = input("Ask a question of the image: ")
  print(pipe(image=img, question=question, words=words, boxes=boxes)

Scenario 3: Mixed Media Types

img = rasterize("my_tax_form.pdf")
words, boxes = text_extract("my_tax_form.pdf")

# NOTE: in certain models, e.g. LayoutLMv1, you do not even need to rasterize/pass in the image as input in this case
out = pipe(image=img, question="What is the person's income?", words=words, boxes=boxes)
# out == [{answer: "$10", score:0.8}, {"answer": "$1000", "score": 0.1}]

I can certainly imagine some alternatives:

  • Words/boxes could be required inputs, and we could simply enforce that the user run OCR (or alternative) before using the pipeline. I think in this case, the image should be considered optional input, simply because certain document processing models take it as input, and others don't.
  • Another would be to allow the user to provide a more advanced "OCR" input that could accept things like PDFs, spreadsheets, etc. and let it call out to OCR or use something else depending on the media type. I would say from experience, handling various document types is a can of worms and it prevents you from reusing pre-processed results across calls to the pipeline. (I believe this is your second suggestion).
  • My original suggestion: words/boxes could be optional, and when not provided, we use a default OCR implementation. One more advantage of this approach is that it's consistent with the LayoutLMv2 processor classes. So if a user starts with this pipeline, and then wants to dig one level deeper to the processor, they'll have a familiar pattern.

Let me know your thoughts. In certain options (namely the first), I think it'd be a requirement for it to be a DocumentQuestionAnsweringPipeline since the required inputs are different than the VisualQuestionAnsweringPipeline. In options 2 or 3, that might not be the case. I don't have a strong opinion about this but just wanted to clarify my understanding/thinking.

@Narsil
Copy link
Contributor

Narsil commented Aug 2, 2022

Ok, thanks for all the explanation !

Now I think I am starting to understand it and all the use cases you displayed really make sense !

I think we can ignore layoutlmv1 not requiring the image so we can keep the number of cases rather small. (If you really know what you're doing you could always send an empty image, or we could just make the code in such a way that sending None doesn't break anything without actively trying to sanitize it)

Since the OCR is indeed quite costly (or can come from a non image !) I can really understand why we would need those optional boxes and texts. So let's support them. (We can make the docs extremely clear on that front)

I think example 1 should really be the focus for newcoming users, and we need to be able to support example 2 and example 3 to be usable in prod.

And if a user sends boxes + texts then we can simply skip the OCR part.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

I will ping @NielsRogge which is also really involved in vision and might have other insights.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

Awesome, I really appreciate you taking the time to dig into this with me. I'll sketch this all out as a next step. And I agree that we can leave the empty image (or None image) as a performance optimization for advanced users. The one thing we'll need to be careful of is that the LayoutLMv1 model gets upset if you do pass in the image (i.e. it's optional for v2/v3 but not for v1 -- v1 does not accept images at all). So if the workaround is to pass in an empty image, we'll just need to figure out a way to cleverly avoid handing it to the model (e.g. implement a no-op feature extractor that takes the image as input and returns an empty dict).

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either, although I am leaning a bit towards the latter so that we can be very clear with the examples/documentation about the use cases (and not muddy the waters with the VisualQuestionAnsweringPipeline which operates directly on the image each time). But of course, I'm open either way.

Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error_

I have no concerns with this. The runtime "perf hit" of converting one format to the other is trivial compared to the other operations involved. I think it's a smart way to prevent an accidental length mismatch.

I think all 3 examples could become tests so that we make sure that those cases are maintained through time.

Great point. I'm happy to contribute these.

@Narsil
Copy link
Contributor

Narsil commented Aug 2, 2022

With all of this context in mind, do you have a preference for whether we extend the existing VisualQuestionAnsweringPipeline or isolate this logic into a DocumentQuestionAnsweringPipeline? I'm okay with either,

Go with DocumentQuestionAnsweringPipeline for now then. In general we try to avoid adding pipelines when we can and when the set of input/output is the same as it makes discoverability and usability on hf.co easier/more consistent.

But you made great points explaining core differences (especially the pdf example for instance), IMO.
If we decide to revisit later or other members have different opinions, we might revisit later (we would do the lifting, and since we're committed to zero breaking change you would still be able to use your code regardless of internal decisions)

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 2, 2022

Okay great, as a next step I'll rework this PR to sketch out DocumentQuestionAnsweringPipeline and address some of your comments on the original change (but may not do all in the first pass, just to optimize for getting feedback sooner). Thanks again for the back and forth and look forward to the next iteration!

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 4, 2022

I just pushed an update that moves the logic into a new DocumentQuestionAnsweringPipeline. I still need to do a few major things:

  • Integrate OCR
  • Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)
  • Integrate the post-processing from the QA pipeline.

I did some sanity testing with a model we've trained and can confirm that it is starting to work! I think we're headed in the right direction.

@Narsil
Copy link
Contributor

Narsil commented Aug 4, 2022

Figure out padding (specifically -- using "return_tensors" basically requires padding, so I could either enforce it or do the unsqueeze trick used in the qa pipeline)

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 4, 2022

Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the preprocess (It just allows for more flexible control over how things are executed). preprocess only processes 1 input at a time, so padding shouldn't be necessary (it might be activable, like truncating, but I don't think it should be the default)

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

I'm happy to do it that way if you prefer -- my working assumption was that the "padding" argument to the tokenizer accomplishes the same thing (but certainly may be missing some interesting implementation detail).

@Narsil
Copy link
Contributor

Narsil commented Aug 5, 2022

If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from tokenize(), and then (b) while constructing the final output, using tensor.unsqueeze(0) (here) to effectively pad each element to the same size.

Ok, this is what I alluded to, QA solves this by using return_overflowing_tokens (and the padding is set to do_no_pad by default).

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/question_answering.py#L283

QA solves this by using ChunkPipeline.
If you want to tackle this, you're more than welcome to it, but IMO it's going to be easier to do it in two steps, and two separate PRs.

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 5, 2022

As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR.

That plan works for me! I'll provide a new update shortly.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 5, 2022

@Narsil I just updated the PR with a few changes that remove the padding/striding stuff (for now) and add some docs. The next steps are to integrate OCR and then refactor/borrow the post-processing code from the QA pipeline. I'll keep working on that but wanted to post an intermediate update in case you had a chance to take a quick look.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 5, 2022

@Narsil another thought / question I had while working on the OCR stuff... Currently, both LayoutLMv2 and v3 have a feature extractor which by default applies OCR. By incorporating OCR into the pipeline itself (which I'm doing by just borrowing their code), we essentially take over that functionality. So, a user may have to do something like this:

pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", tokenizer="layoutlmv3-xxx", feature_extractor=AutoFeatureExtractor.from_pretrained("layoutlmv3-xxx", apply_ocr=False))

out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}]

Essentially, we'll want to rely on the pipeline's OCR, not the feature extractor's. However as a result, we make the user experience a bit awkward (since they have to provide "apply_ocr" False in one place). I can think of a few solutions to this:

  1. We could rely on the user providing a feature extractor as input, and then invoke the feature extractor in preprocess(), essentially following the conventions that LayoutLMv2Processor/LayoutLMv3Processor do (call the feature extractor and then the tokenizer). If they provide neither a feature extractor nor words, we can provide a helpful error message that they must provide a feature extractor that returns words. One major downside to this approach is that users of models like LayoutLMv1 will not ever get OCR run for them by the pipeline, but I'm open to implementing a feature extractor for LayoutLMv1 to solve this.
  2. If they provide a feature extractor, we could try to check whether it'll run OCR (e.g. by checking whether its "apply_ocr" attribute is True). If it will, then we can rely on the feature extractor to provide words/boxes. If not, and they haven't passed in words to the pipeline, then we can run OCR. I think the major downside is depending on a non-standard flag (apply_ocr) in the generic pipeline code. I'm not sure how you all think about this tradeoff -- it may be fine to do. A slight variant of this is to test whether after running the feature extractor, we have words and boxes available in its output.
  3. We could just ignore this altogether and let the user be the expert. I.e. if they pass in a feature extractor and have not specified apply_ocr=False, it will run OCR twice (once in the pipeline and once in the feature extractor), which is an unnecessary perf hit, but makes no assumptions about the feature extractor itself.

Let me know your thoughts.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 6, 2022

@Narsil I think I've implemented all of what we talked about (and apologies in advance if I missed anything). To summarize:

  • Padding/truncation are gone. I've left them commented out, because we plan to address them as a follow up (in this or a fast-follow PR), but I'm happy to remove those comments too if you prefer.
  • OCR is integrated. Regarding my question just above, I went down the route of option 2, and check whether the feature extractor returned words/boxes before trying to run OCR, which the pipeline natively supports.
  • I refactored the tricky postprocessing parts of the QA pipeline into helper functions which I call from the document question answering pipeline.
  • I've copied the relevant subsets of the code (including PR Add LayoutLMForQuestionAnswering model #18407) and published it here with some examples. Feel free to play around with it!

As a next step, I'd appreciate a quick review from you on these major points to verify whether we're on the right track. I'd like to add the tests and more documentation next (pending your feedback on if we are in a good place with the interface/overall design). I also have a few questions regarding the tests/docs:

  • The existing tests for both question-answering and visual-question-answering use models published on HF. There aren't (currently that I could find) any reliablen doc qa models. I have published one here, but there's a bit of a soft dependency on PR Add LayoutLMForQuestionAnswering model #18407 because the model we're publishing uses LayoutLMv1. You can access the model w/ remote code enabled, but I'm not sure that's advisable for a test in the repo. It'd also be good to have tests that span multiple models (e.g. v1-v3) because there are some differences in their tokenizers.
  • Is there any way to use a processor in a pipeline? The reason I ask is that LayoutLMv2 and v3 have some interesting logic encapsulated in their processors (e.g. LayoutLMv2 renames the input to the model from image to pixel_values and v3 to image_features). It'd be great to reuse the logic in those classes within the pipeline. Alternatively, I could just support LayoutLMv1 to start with and we can work on adding support for the other versions in a follow up PR.
  • Should I add docs anywhere other than the code itself (which I assume would show up here)? For example a place like here) as a tutorial for how document question answering works?

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 9, 2022

@Narsil gentle nudge in case this slipped from your queue :)

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?

Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Aug 9, 2022

So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?

Yes that's correct.

Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?

The interface should support it. As input, you provide an image+question (and optional word/box pairs if you've pre-run OCR) and as output you receive an answer + start/end words. For a generative model, I could imagine either omitting the start/end or the pipeline doing its best to find it in the document if it exists.

Code-wise, there may be some refactoring within the pipeline implementation to best support a model like Donut. Very happy to collaborate with you on that.

@ankrgyl ankrgyl force-pushed the llm-qa branch 3 times, most recently from b6ef1bd to bc6ee38 Compare August 13, 2022 19:29
@ankrgyl ankrgyl force-pushed the llm-qa branch 3 times, most recently from 7aaff82 to a3074c2 Compare September 5, 2022 22:37
@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 5, 2022

@NielsRogge @Narsil @sgugger @patrickvonplaten I think we are pretty much there -- I just updated to address the outstanding comments, and the failing tests look like they are flaky (unrelated to this change). Let me know if there's anything else needed on this PR.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new pipeline. Left a couple of small comments and then it should be good to merge.

There a lot of followups/TODOs in this PR, might be worth opening an issue to not lose track of them.

"default": {
"model": {
"pt": ("impira/layoutlm-document-qa", "3a93017")
}, # TODO Update with custom pipeline removed, just before we land
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, good catch, removed.

Comment on lines 243 to 248
def preprocess(
self,
input,
lang=None,
tesseract_config="",
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def preprocess(
self,
input,
lang=None,
tesseract_config="",
):
def preprocess(self, input, lang=None, tesseract_config=""):

Nit: fits in one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (seems like Black was keeping it on separate lines because of the trailing comma)

)

if isinstance(self.model, VisionEncoderDecoderModel):
task_prompt = f'<s_docvqa><s_question>{input["question"]}</s_question><s_answer>'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refine the test to be more specific to Donut then? Vision encoder decoder is a wide class.

if "boxes" not in tokenizer_kwargs:
bbox = []
for batch_index in range(num_spans):
for i, s, w in zip(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use better names here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 364 to 369
return {
**encoding,
"p_mask": p_mask,
"word_ids": word_ids,
"words": words,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fits in one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

def postprocess_encoder_decoder(self, model_outputs, **kwargs):
# postprocess
sequence = self.tokenizer.batch_decode(model_outputs.sequences)[0]
sequence = sequence.replace(self.tokenizer.eos_token, "").replace(self.tokenizer.pad_token, "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

)

word_ids = model_outputs["word_ids"][0]
for s, e, score in zip(starts, ends, scores):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use better names for s (start) and e (end) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fixed.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 6, 2022

Thanks @sgugger. Should I open one issue with all of them or a separate issue for each TODO? Happy to do that.

@sgugger
Copy link
Collaborator

sgugger commented Sep 6, 2022

I think one issue with the list of TODOs is fine.

@ankrgyl
Copy link
Contributor Author

ankrgyl commented Sep 7, 2022

Here is the task with follow ups: #18926.

@sgugger sgugger merged commit 2ef7742 into huggingface:main Sep 7, 2022
@sgugger
Copy link
Collaborator

sgugger commented Sep 7, 2022

Thanks again for your contribution!

oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
* [WIP] Skeleton of VisualQuestionAnweringPipeline extended to support LayoutLM-like models

* Fixup

* Use the full encoding

* Basic refactoring to DocumentQuestionAnsweringPipeline

* Cleanup

* Improve args, docs, and implement preprocessing

* Integrate OCR

* Refactor question_answering pipeline

* Use refactored QA code in the document qa pipeline

* Fix tests

* Some small cleanups

* Use a string type annotation for Image.Image

* Update encoding with image features

* Wire through the basic docs

* Handle invalid response

* Handle empty word_boxes properly

* Docstring fix

* Integrate Donut model

* Fixup

* Incorporate comments

* Address comments

* Initial incorporation of tests

* Address Comments

* Change assert to ValueError

* Comments

* Wrap `score` in float to make it JSON serializable

* Incorporate AutoModeLForDocumentQuestionAnswering changes

* Fixup

* Rename postprocess function

* Fix auto import

* Applying comments

* Improve docs

* Remove extra assets and add copyright

* Address comments

Co-authored-by: Ankur Goyal <ankur@impira.com>
@arun007chauhan
Copy link

can you please help me,
i am stuck,
let me explain,
if there is a grocery list in an invoice, I want the entire list as answer

but I tried a lot, it only give 1st value only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LayoutLM-based visual question answering model, weights, and pipeline
7 participants