-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocrx_line is missing from the ocr-capabilities metadata field #94
Comments
Should this be ocrodjvu uses Tesseract uses |
Currently, pdftotree does not properly handle multi-column document. The 2nd page in the https://github.com/HazyResearch/pdftotree/blob/v0.5.0/tests/input/paleo.pdf looks like this: The corresponding part of pdftotree output looks like this:
As can been seen above, the result is not in reading order; hence pdftotree should use |
Describe the bug
An output hOCR contains
<meta content="ocr_page ocr_table ocrx_block ocrx_word" name="ocr-capabilities"/>
, which missesocrx_line
.To Reproduce
Steps to reproduce the behavior:
$ pdftotree some.pdf -o some.hocr
Expected behavior
The metadata field should contain
ocrx_line
.Error Logs/Screenshots
If applicable, add error logs or screenshots to help explain your problem.
Environment (please complete the following information):
pdftotree
Version: v0.5.0Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: