Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocVQA input input_ids at training time #312

Open
cccccckt opened this issue Sep 2, 2024 · 1 comment
Open

DocVQA input input_ids at training time #312

cccccckt opened this issue Sep 2, 2024 · 1 comment

Comments

@cccccckt
Copy link

cccccckt commented Sep 2, 2024

I don't know if there was a problem with the data processing or the metadata.jsonl file was created incorrectly. I found that the input_ids input to the donut model contained the answer part. Is this normal?
You can see the following input_ids:

tensor([[57527, 57529, 11604, 52743, 48941, 45383, 18528, 43095, 36477, 46385, 35647, 36209, 57524, 57526, 46481, 23485, 35815, 4768, 57523, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:2')

I used tokenizer to decode it and got the result:
<s_docvqa> <s_question> ▁When ▁is ▁the ▁response ▁code ▁request ▁form ▁dat ed ? </s_question> <s_answer> ▁September ▁10 , ▁1996 </s_answer> </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

From my experience or based on the code provided by the author, prompt should be like:
<s_docvqa> <s_question><question></s_question> <s_answer>

Part of my metadata.jsonl file is as follows
{"file_name": "sxxj0037_2.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"How many points are there in modifications to readout instrumentation\", \"answer\": \"5.\"}]}"}
{"file_name": "tynx0037_1.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"What is the first line of the address mentioned at the top?\", \"answer\": \"Reynolds Building\"}, {\"question\": \"What is the date mentioned?\", \"answer\": \"May 4, 2000\"}]}"}
{"file_name": "mtyj0226_1.png", "ground_truth": "{\"gt_parses\": [{\"question\": \"What is the word written in bold black in the first picture?\", \"answer\": \"Coke\"}]}"}

@cccccckt
Copy link
Author

cccccckt commented Sep 3, 2024

This is the input_ids and label id generation process in the util.py file, and I see that input_ids includes the answer during the training phase
`

if self.split == "train":
    labels = input_ids.clone()
    labels[
        labels == self.donut_model.decoder.tokenizer.pad_token_id
    ] = self.ignore_id  # model doesn't need to predict pad token
    labels[
        : torch.nonzero(labels == self.prompt_end_token_id).sum() + 1
    ] = self.ignore_id  # model doesn't need to predict prompt (for VQA)
    return input_tensor, input_ids, labels
else:
    prompt_end_index = torch.nonzero(
        input_ids == self.prompt_end_token_id
    ).sum()  # return prompt end index instead of target output labels
    return input_tensor, input_ids, prompt_end_index, processed_parse`

In other words, the input in training needs to bring the answer, but it is not needed in reasoning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant