DocVQA input input_ids at training time #312

cccccckt · 2024-09-02T14:13:21Z

I don't know if there was a problem with the data processing or the metadata.jsonl file was created incorrectly. I found that the input_ids input to the donut model contained the answer part. Is this normal?
You can see the following input_ids：

tensor([[57527, 57529, 11604, 52743, 48941, 45383, 18528, 43095, 36477, 46385, 35647, 36209, 57524, 57526, 46481, 23485, 35815, 4768, 57523, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:2')

I used tokenizer to decode it and got the result:
<s_docvqa> <s_question> ▁When ▁is ▁the ▁response ▁code ▁request ▁form ▁dat ed ? </s_question> <s_answer> ▁September ▁10 , ▁1996 </s_answer> </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

From my experience or based on the code provided by the author, prompt should be like:
<s_docvqa> <s_question><question></s_question> <s_answer>

Part of my metadata.jsonl file is as follows
{"file_name": "sxxj0037_2.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"How many points are there in modifications to readout instrumentation\", \"answer\": \"5.\"}]}"}
{"file_name": "tynx0037_1.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"What is the first line of the address mentioned at the top?\", \"answer\": \"Reynolds Building\"}, {\"question\": \"What is the date mentioned?\", \"answer\": \"May 4, 2000\"}]}"}
{"file_name": "mtyj0226_1.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"What is the word written in bold black in the first picture?\", \"answer\": \"Coke\"}]}"}

The text was updated successfully, but these errors were encountered:

cccccckt · 2024-09-03T04:01:04Z

This is the input_ids and label id generation process in the util.py file, and I see that input_ids includes the answer during the training phase
`

if self.split == "train":
    labels = input_ids.clone()
    labels[
        labels == self.donut_model.decoder.tokenizer.pad_token_id
    ] = self.ignore_id  # model doesn't need to predict pad token
    labels[
        : torch.nonzero(labels == self.prompt_end_token_id).sum() + 1
    ] = self.ignore_id  # model doesn't need to predict prompt (for VQA)
    return input_tensor, input_ids, labels
else:
    prompt_end_index = torch.nonzero(
        input_ids == self.prompt_end_token_id
    ).sum()  # return prompt end index instead of target output labels
    return input_tensor, input_ids, prompt_end_index, processed_parse`

In other words, the input in training needs to bring the answer, but it is not needed in reasoning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocVQA input input_ids at training time #312

DocVQA input input_ids at training time #312

cccccckt commented Sep 2, 2024

cccccckt commented Sep 3, 2024 •

edited

Loading

DocVQA input input_ids at training time #312

DocVQA input input_ids at training time #312

Comments

cccccckt commented Sep 2, 2024

cccccckt commented Sep 3, 2024 • edited Loading

cccccckt commented Sep 3, 2024 •

edited

Loading