You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to train the base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:
The model is trained to read all texts in the image in reading order (from top-left to bottom-right, basically). [...] This task can be interpreted as a pseudo-OCR task.
What does "top-left to bottom-right" mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns:
Should the document be transcribed as:
Word1 Col1w1 Col1w2 Col2w1 Col2w2, or
Word1 Col1w1 Col2w1 Col1w2 Col2w2 ?
I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of "top-left to bottom-right", neither in the paper, the paper supplement, nor the source code. After all, "top-left to bottom-right" can be interpreted in different ways:
top-to-bottom, left-to-right
left-to-right, top-to-bottom
clustering of words into text blocks to mimic semantically meaningful text paragraphs
etc.
The text was updated successfully, but these errors were encountered:
I would like to train the base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:
What does "top-left to bottom-right" mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns:
Should the document be transcribed as:
I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of "top-left to bottom-right", neither in the paper, the paper supplement, nor the source code. After all, "top-left to bottom-right" can be interpreted in different ways:
The text was updated successfully, but these errors were encountered: