-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DocumentQuestionAnswering pipeline #18414
Conversation
@Narsil this is basically a skeleton implementation that I thought I'd send out sooner than later to start getting your input. I've left a few questions throughout tagged with "TODO" in the comments. The big question is how much/whether to reuse the code in QuestionAnsweringPipeline, which has a lot of overlap (notably preparing the spans and post-processing the output). For example, I could refactor out methods like |
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR despite my many comments (a lot are regarding the same items).
Thanks for this work!
The biggest things to make this PR easier to go through:
- Don't do padding/truncating and let the model crash when it's too big. (It will require a new PR to support context splitting IMO)
- Don't try to support inference without image, that should be a different pipeline.
- Flat dicts at
_forward
boundary. - Add tests (both fast and slow, I can definitely help set them up)
Please let me know if anything is unclear or you're not sure how to fix, I can definitely help go further make code edits, you've done most of the work here.
For the is_split_into_words
flag, don't worry about it, I tried to raise the concern but there's no real great solution anyway it seems so this will be good enough.
if (isinstance(image, (Image.Image, str)) or image is None) and isinstance(question, str): | ||
inputs = {"question": question} | ||
if image is not None: | ||
inputs["image"] = image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the image should be mandatory for it to make sense in this pipeline.
How can the model do anything without an image ?
Optional inputs are really not great, since it makes pipeline usage sort of guesswork.
Optional inputs are called parameters and none should be mandatory (so at least there's common grounds for users to use a model with some sane defaults)
Maybe these models can implement both qa
and vqa
pipelines then ? (But the qa
also requires a context
instead of an image).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rework this per our discussion about the role of the image
vs. words/boxes.
image = load_image(inputs["image"]) | ||
model_inputs = self.tokenizer( | ||
inputs["question"], return_tensors=self.framework, padding=padding, truncation=truncation | ||
def preprocess(self, inputs, padding=False, truncation=False, words=[], boxes=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never use lists in default arguments (=[]
):
https://docs.python-guide.org/writing/gotchas/
Also words
and boxes
are NOT parameters, they should come from OCRing the image. (IIUC)
parameters
are something that should always be possible to enable when instantiating the whole pipeline
pipe = pipeline(...., arg=XX)
for result in pipe(dataset):
print(result)
Here it doesn't seem like words
or boxes
make sense in this context. They should come from the data itself.
If it becomes an actual inputs -> Different pipeline (Question answering maybe, but the names are going to be different, and no bounding boxes can be sent either for the same reason, maybe we can implement without sending boxes ? What are even boxes when there is no image ?)
If it comes from the image, everything seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG how did I not know that! Thank you for pointing it out. I'll fix the default argument piece and rework these into optional arguments per our discussion below.
if "words" in inputs: | ||
# TODO: Can we refactor and share (or inherit from) the QuestionAnweringPipeline? I think we want to do | ||
# the same thing with the tokenizer, except we have a few extra arguments (e.g. the bounding boxes). | ||
extra_tokenizer_params = { | ||
"return_token_type_ids": True, | ||
"return_attention_mask": True, | ||
"is_split_into_words": True, | ||
# "max_length": 512, | ||
} | ||
padding = "max_length" | ||
truncation = True | ||
question = [question] | ||
text_pair = inputs["words"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you shouldn't really override padding/truncation by default.
padding
/truncation
should be user controlled never done automatically (it has some performance implications both quality wise and speedwise). Pipelines should never hide such details (It's done on some pipelines but that's purely legacy). (We can provide some simple parameters arguments to workaround those issues)
Why did you choose to add these ? There might be another way to do the same thing.
padding = "max_length" | ||
truncation = True | ||
question = [question] | ||
text_pair = inputs["words"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_split_into_words
is really dangerous and should not be used as much as possible. It bypasses the pretokenizer
of the tokenizer and means the actual resulting tokens might be very different from what the model is supposed to see. It's actually only used in very corner cases when there is no real other way.
Let's say your OCR detects "New York" as a single box, and the tokenizer is wordlevel and knows only "New" and "york" and uses a space splitting pre_tokenizer, then using is_split_into_words
means the model is going to receive unk
(since it cannot correctly split the space) instead of ["new", "york"]
.
If I understand correctly, you're trying to work around the fact that you cannot send a list of strings (the boxes found in the image) directly to the tokenizer. (There's only text
+ text_pair
. )
Here we can probably workaround that fact by using a single string (instead of a list) and dealing with offsets
instead.
Another option would be to do the post_processor
"manually" (directly using the underlying tokenizer).
Both options are not great, but they would really avoid the caveats that is implied with is_split_into_words
which IMO is really a danger here. (Most tokenizers have no concept of words, and depending of the OCR actual chunking it can cause issues as shown, the first option's issue is that the offsets boundary might not be on the boxes, ).
Since this is more involved, we can keep that as-is for this PR, which a comment noting the caveat and eventually fix later.
Maybe tokenizers
can make the fix even simpler by simply accepting lists of strings instead of just 1 or a pair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay yeah I totally agree with you. The reason I pass this argument in is that the tokenizer complains if I try to pass in an (unsplit) question and a (split) text_pair
of words, and for this use case, we don't need to index within the question to produce an answer (accuracy/performance issues aside).
I believe the LayoutLMv3Tokenizer
gets around this limitation by tokenizing the question separately from the words in this case. I appreciate you letting us keep it as is for this PR and then we can work on a better solution (which I'm happy to help contribute).
for i, s, w in zip( | ||
encoding.input_ids[batch_index], | ||
encoding.sequence_ids(batch_index), | ||
encoding.word_ids(batch_index), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you use word ids but as basically the box_id
and the sequence_id
as a flag to know if the token is in the context or not. Right ?
If yes, that's just an confirmation that there's a problem with the current code (totally understandable, you're working around limitations) because the "words" are not really words, but chunks of text in an image.
Again no real change needed, just emphasizing that this behavior will rely on the OCR and tokenizer
working in specific ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You could read this code colloquially as "if the box belongs to a word in the doc (i.e. sequence id is 1) then assign it a bounding box based on its original (word id) index. If the box belongs to a separator token, assign it [1000,1000,1000,1000]. Otherwise, assign it to [0,0,0,0]".
The challenge you're raising (words vs. boxes in the image) is a very real challenge for document processing tools. Most do not support "within-box" predictions. Definitely a problem worth visiting in the future.
|
||
# TODO: This is a very poor implementation of start/end (just here for completeness sake). | ||
# Ideally we can refactor/borrow the implementation in the question answering pipeline. | ||
results = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of reusing qa
implementation here (especially since it's really non-trivial).
The easiest thing to do should be to extract the qa
's code into a function and reuse it here if it's exactly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that!
def __call__(self, image: Union["Image.Image", str], question: str = None, **kwargs): | ||
def __call__( | ||
self, | ||
image: [Optional[Union["Image.Image", str]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be set at call time. It's a visual question answering so it does require an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will adjust per our discussion below. I think what we should do is make image
a required argument, and then (without making it too obvious), allow an advanced user to pass None
(or another empty-like value) and handle it gracefully within the implementation.
def __call__( | ||
self, | ||
image: [Optional[Union["Image.Image", str]]] = None, | ||
question: Optional[Union[List[str], str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This None
here exists solely to support pipeline({"image": image, "question": question})
(Which is necessary when you want to work with datasets or lists directly).
It's not documented kind of on purpose, the documentation bodies showcases the various uses, but here the strict-typing just causes things to be less readable IMO, since if you do send it it should be a str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the documentation below and the List
options for question
and image
are actually wrong and shouldn't be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool. I'll just revert this since we'll place things in a new pipeline anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to do the typing in the new pipeline. The most critical piece is the docstring (and this is the one where readability > completeness)
if "logits" in model_outputs: | ||
return postprocess_sequence_output(self.model, model_outputs, self.framework, top_k) | ||
elif "start_logits" in model_outputs and "end_logits" in model_outputs: | ||
return postprocess_qa_output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to do it this way.
This is directly linked to the ModelForXX
api to return logits
or start_logits
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay cool, thanks for the clarification.
scores = scores.tolist() | ||
ids = ids.tolist() | ||
return [{"score": score, "answer": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)] | ||
assert False, "Unknown output format" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again please raise real exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. These were meant to be temporary as I was sketching out the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries.
@Narsil thank you for the review! Before I go in and apply the comments, I thought it might be worth discussing the required (or not) The reason (I think) It's important to allow users to pass in words/boxes instead of an image is that users often either want to run their own OCR (e.g. using a proprietary system like Google/Microsoft) OR are extracting data from documents that have embedded text (e.g. many PDF documents, Word, Excel, etc.). Furthermore, there are a lot of pre-processing tricks that are relevant to certain OCR implementations (e.g. some try to order words by line, others by block, etc.) that have a very significant impact on BERT-inspired models like LayoutLM (because of the attention mechanism, position ids, etc.). tl;dr, having some control over words/boxes is very important if you're trying to use the pipeline in a production scenario. Now, you could argue that if they want to do this, they could use the question answering pipeline. In fact, when I started exploring HuggingFace/transformers, I did just that! The problem is that if you join everything together (into I could also see an argument that if a user wants this level of control, they shouldn't use a pipeline in the first place, but the implementation of QA preprocessing and postprocessing are really compelling -- which kind of drew us to really wanting to take advantage of them vs. try to reinvent them elsewhere. Hopefully that makes sense and adds some context for why I proposed making |
Yes very much so ! I will try and convey this part of the library's perspective, but having your view is great too since I probably know less about OCRs and overall document processing than you.
This is very interesting to know ! The fact that the OCR has much impact on the quality of the results doesn't surprise me (and the
Yes I felt the same thing when reading your code and pondering whether it should actually belonged or not in
Very reasonable ! :)
Ok, I am going to recap the main goal of the pipeline:
That's the first goal, and in doing that, we don't want to hide any ML details that might hurt users unknowingly (like chunking things that can hurt output quality without being opt-in). So hide as many things as possible, when the defaults are correct, but don't hide any magic that could be use-case dependent. For instance, truncating without asking for explicit user consent (via a parameter) means the user will try and send large chunks of texts, and get an output that will correspond only to a tiny chunk of it, without him realizing it. Another secondary goal is to make them as reusable/extendable as possible, but only when it doesn't contradict any of the previous goals. With that in mind, you see why having inputs/outputs that depend on the actual model type, forces non ML practitioners to know about model types, where the goal is to try and lift that burden. If we can ensure that sending the same input, will receive the same output, it means users can jump very easily between models. So when AwesomeModelA comes out, you can just swap its name and make it work. Same goes for iterations/fine-tuning of the same model or different models and so one. Here I can see I think two solutions: 1/ We create a new pipeline ( 2/ We keep using
pipe = pipeline("mymodel-id", ocr=MyOCR())
class MyOCR:
def forward(self, image):
...
return texts, boxes What do you think ? Which solution makes most sense from your perspective ? Also regardless of choice here, we can extract whatever makes sense as an individual function within |
For Option 1, to clarify, would you be open to allowing the user to pass in (optional) words and boxes? I think this is conceptually similar to your point about audio pipelines using ffmpeg but I may be misunderstanding something. Essentially, we'd run OCR on the image if the words/boxes are not passed in. And either way, depending on the model, pass the image into the model as well. If we made the words/boxes an optional input, then users could basically assert control where they'd like to, but the simple use cases will just work out of the box. Personally, I think option 1 is the way to go. I can at least sketch out the code as a next step and we can take a look and reevaluate. |
I have the same uneasiness with any optional inputs. Either the pipeline needs the data or it doesn't. IMO the incoming data should be as strongly typed as possible, and definitely the computation should not depend on what the user actually sent (because then it becomes really hard to reason about what actually happened on a piece of data, which OCR was used ? Were the boxes correct ? etc...). I feel like I am missing a piece of the puzzle here, so maybe we can do the other way around, let's try to devise what we would actually like to write as a user for this document processing. IMO the simplest is something like: pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}] Or maybe be a little more strict: pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
# ValueError : This model is a document processing model, and requires an OCR to be able to use this pipeline,
# please pass an OCR. For demos, you can use `from transformers.pipelines import DefaultOCR`
pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", ocr=DefaultOCR())
out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}, ...] |
Ahh, okay, yes I agree that working from these examples is really helpful. Let me first be precise about what is required vs. not:
I think part of why it's structured this way, or at least one of the advantages, is that in practice, since OCR can be costly (time and $$), many document processing practitioners will run OCR as a pre-processing step, so you can reuse its results across many invocations of extractions/questions/etc. E.g. imagine you were building an app that lets you point at a folder on your computer and then ask the files questions interactively. You'd probably implement this app by first running OCR on each file, and then re-using the OCR each time a user provides a new question as input. I think with this in mind, there are probably a few different use cases that would be ideal to capture in the pipeline. I fully recognize that some of these may qualify as "more advanced" than the scope of a pipeline, so I'm open to and appreciate your push back on where that may be the case. Scenario 1: Single file, single question(your example above) pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}] Scenario 2: Interactive REPL(this is an approximation of a real-world use case) pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx")
img = Image.load("id_card.jpg")
words, boxes = my_favorite_ocr(img)
while True:
question = input("Ask a question of the image: ")
print(pipe(image=img, question=question, words=words, boxes=boxes) Scenario 3: Mixed Media Typesimg = rasterize("my_tax_form.pdf")
words, boxes = text_extract("my_tax_form.pdf")
# NOTE: in certain models, e.g. LayoutLMv1, you do not even need to rasterize/pass in the image as input in this case
out = pipe(image=img, question="What is the person's income?", words=words, boxes=boxes)
# out == [{answer: "$10", score:0.8}, {"answer": "$1000", "score": 0.1}] I can certainly imagine some alternatives:
Let me know your thoughts. In certain options (namely the first), I think it'd be a requirement for it to be a |
Ok, thanks for all the explanation ! Now I think I am starting to understand it and all the use cases you displayed really make sense ! I think we can ignore layoutlmv1 not requiring the image so we can keep the number of cases rather small. (If you really know what you're doing you could always send an empty image, or we could just make the code in such a way that sending Since the OCR is indeed quite costly (or can come from a non image !) I can really understand why we would need those optional I think And if a user sends Actually, wdyt about having a list of tuples instead of two lists ? Two lists enables having different sized lists which would silently break things, I usually tend to prefer arguments that cannot by design be inconsistent, and lists of tuples cannot have different sizes and will necessarily raise errors when the tuple is unwrapped, so less room for error I think all 3 examples could become tests so that we make sure that those cases are maintained through time. I will ping @NielsRogge which is also really involved in vision and might have other insights. |
Awesome, I really appreciate you taking the time to dig into this with me. I'll sketch this all out as a next step. And I agree that we can leave the empty image (or None image) as a performance optimization for advanced users. The one thing we'll need to be careful of is that the LayoutLMv1 model gets upset if you do pass in the image (i.e. it's optional for v2/v3 but not for v1 -- v1 does not accept images at all). So if the workaround is to pass in an empty image, we'll just need to figure out a way to cleverly avoid handing it to the model (e.g. implement a no-op feature extractor that takes the image as input and returns an empty dict). With all of this context in mind, do you have a preference for whether we extend the existing
I have no concerns with this. The runtime "perf hit" of converting one format to the other is trivial compared to the other operations involved. I think it's a smart way to prevent an accidental length mismatch.
Great point. I'm happy to contribute these. |
Go with But you made great points explaining core differences (especially the pdf example for instance), IMO. |
Okay great, as a next step I'll rework this PR to sketch out |
I just pushed an update that moves the logic into a new
I did some sanity testing with a model we've trained and can confirm that it is starting to work! I think we're headed in the right direction. |
Not sure I understand, in the pipelines the padding should be done by the pipeline itself, not by the |
If I'm understanding the QA pipeline code correctly, the reason padding is relevant is that if you stride a document (e.g. one with more than 512 words), then one item that you preprocess might result multiple inputs to the model that get concatenated together in one big tensor. The question answering pipeline has to solve for this too, and it seems to do that by (a) not returning tensors from I'm happy to do it that way if you prefer -- my working assumption was that the "padding" argument to the tokenizer accomplishes the same thing (but certainly may be missing some interesting implementation detail). |
Ok, this is what I alluded to, QA solves this by using QA solves this by using As a first step I would recommend simply not treating padding, and let sequences too long go to to the model, which will then crash. It's aligned with the "don't hide" anything policy. Some models can handle long range, most cannot, so not trying to hide that fact is IMO a good thing. We can add an auto chunking in a follow UP PR. |
That plan works for me! I'll provide a new update shortly. |
@Narsil I just updated the PR with a few changes that remove the padding/striding stuff (for now) and add some docs. The next steps are to integrate OCR and then refactor/borrow the post-processing code from the QA pipeline. I'll keep working on that but wanted to post an intermediate update in case you had a chance to take a quick look. |
@Narsil another thought / question I had while working on the OCR stuff... Currently, both LayoutLMv2 and v3 have a feature extractor which by default applies OCR. By incorporating OCR into the pipeline itself (which I'm doing by just borrowing their code), we essentially take over that functionality. So, a user may have to do something like this: pipe = pipeline(task="visual-question-answering", model="layoutlmv3-xxx", tokenizer="layoutlmv3-xxx", feature_extractor=AutoFeatureExtractor.from_pretrained("layoutlmv3-xxx", apply_ocr=False))
out = pipe(image=Image.load("id_card.jpg"), question="What is this person's address ?")
# out == [{answer: "24 nerdy street", score:0.8}, {"answer": "John travolta", "score": 0.1}] Essentially, we'll want to rely on the pipeline's OCR, not the feature extractor's. However as a result, we make the user experience a bit awkward (since they have to provide "apply_ocr"
Let me know your thoughts. |
@Narsil I think I've implemented all of what we talked about (and apologies in advance if I missed anything). To summarize:
As a next step, I'd appreciate a quick review from you on these major points to verify whether we're on the right track. I'd like to add the tests and more documentation next (pending your feedback on if we are in a good place with the interface/overall design). I also have a few questions regarding the tests/docs:
|
@Narsil gentle nudge in case this slipped from your queue :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So rather than extending the VQA pipeline, it seems that the design has been updated to create a separate DocumentQuestionAnswering pipeline?
Also, I'd like to note that there's a new model I'm working on called Donut which solved DocVQA in a generative manner. Donut is generative T5-like model, which simply generates the answer given a question. Would this pipeline be able to support that model as well?
Yes that's correct.
The interface should support it. As input, you provide an image+question (and optional word/box pairs if you've pre-run OCR) and as output you receive an answer + start/end words. For a generative model, I could imagine either omitting the start/end or the pipeline doing its best to find it in the document if it exists. Code-wise, there may be some refactoring within the pipeline implementation to best support a model like Donut. Very happy to collaborate with you on that. |
b6ef1bd
to
bc6ee38
Compare
7aaff82
to
a3074c2
Compare
@NielsRogge @Narsil @sgugger @patrickvonplaten I think we are pretty much there -- I just updated to address the outstanding comments, and the failing tests look like they are flaky (unrelated to this change). Let me know if there's anything else needed on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new pipeline. Left a couple of small comments and then it should be good to merge.
There a lot of followups/TODOs in this PR, might be worth opening an issue to not lose track of them.
"default": { | ||
"model": { | ||
"pt": ("impira/layoutlm-document-qa", "3a93017") | ||
}, # TODO Update with custom pipeline removed, just before we land |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, good catch, removed.
def preprocess( | ||
self, | ||
input, | ||
lang=None, | ||
tesseract_config="", | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def preprocess( | |
self, | |
input, | |
lang=None, | |
tesseract_config="", | |
): | |
def preprocess(self, input, lang=None, tesseract_config=""): |
Nit: fits in one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (seems like Black was keeping it on separate lines because of the trailing comma)
) | ||
|
||
if isinstance(self.model, VisionEncoderDecoderModel): | ||
task_prompt = f'<s_docvqa><s_question>{input["question"]}</s_question><s_answer>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refine the test to be more specific to Donut then? Vision encoder decoder is a wide class.
if "boxes" not in tokenizer_kwargs: | ||
bbox = [] | ||
for batch_index in range(num_spans): | ||
for i, s, w in zip( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use better names here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
return { | ||
**encoding, | ||
"p_mask": p_mask, | ||
"word_ids": word_ids, | ||
"words": words, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fits in one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
def postprocess_encoder_decoder(self, model_outputs, **kwargs): | ||
# postprocess | ||
sequence = self.tokenizer.batch_decode(model_outputs.sequences)[0] | ||
sequence = sequence.replace(self.tokenizer.eos_token, "").replace(self.tokenizer.pad_token, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
) | ||
|
||
word_ids = model_outputs["word_ids"][0] | ||
for s, e, score in zip(starts, ends, scores): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use better names for s (start) and e (end) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes fixed.
Thanks @sgugger. Should I open one issue with all of them or a separate issue for each TODO? Happy to do that. |
I think one issue with the list of TODOs is fine. |
Here is the task with follow ups: #18926. |
Thanks again for your contribution! |
* [WIP] Skeleton of VisualQuestionAnweringPipeline extended to support LayoutLM-like models * Fixup * Use the full encoding * Basic refactoring to DocumentQuestionAnsweringPipeline * Cleanup * Improve args, docs, and implement preprocessing * Integrate OCR * Refactor question_answering pipeline * Use refactored QA code in the document qa pipeline * Fix tests * Some small cleanups * Use a string type annotation for Image.Image * Update encoding with image features * Wire through the basic docs * Handle invalid response * Handle empty word_boxes properly * Docstring fix * Integrate Donut model * Fixup * Incorporate comments * Address comments * Initial incorporation of tests * Address Comments * Change assert to ValueError * Comments * Wrap `score` in float to make it JSON serializable * Incorporate AutoModeLForDocumentQuestionAnswering changes * Fixup * Rename postprocess function * Fix auto import * Applying comments * Improve docs * Remove extra assets and add copyright * Address comments Co-authored-by: Ankur Goyal <ankur@impira.com>
can you please help me, but I tried a lot, it only give 1st value only |
What does this PR do?
This PR extends VisualQuestionAnsweringPipeline to accept
words
andboxes
as input, passes them into the tokenizer/model (along with the question), and post-processes theirQuestionAnsweringModelOutput
response.Fixes #18380
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@Narsil