Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrx_line is missing from the ocr-capabilities metadata field #94

Closed
HiromuHota opened this issue Oct 20, 2020 · 2 comments · Fixed by #95
Closed

ocrx_line is missing from the ocr-capabilities metadata field #94

HiromuHota opened this issue Oct 20, 2020 · 2 comments · Fixed by #95
Assignees
Labels

Comments

@HiromuHota
Copy link
Contributor

Describe the bug

An output hOCR contains <meta content="ocr_page ocr_table ocrx_block ocrx_word" name="ocr-capabilities"/>, which misses ocrx_line.

To Reproduce
Steps to reproduce the behavior:

  1. $ pdftotree some.pdf -o some.hocr

Expected behavior

The metadata field should contain ocrx_line.

Error Logs/Screenshots
If applicable, add error logs or screenshots to help explain your problem.

Environment (please complete the following information):

  • pdftotree Version: v0.5.0

Additional context
Add any other context about the problem here.

@HiromuHota HiromuHota self-assigned this Oct 20, 2020
@HiromuHota HiromuHota added the bug label Oct 20, 2020
@HiromuHota
Copy link
Contributor Author

Should this be ocr_line instead of ocrx_line?
kba/hocr-spec#19

ocrodjvu uses <span class="ocrx_line"/> to represent TEXT_ZONE_LINE in djvu.
https://github.com/jwilk/ocrodjvu/blob/0.12/lib/hocr.py

Tesseract uses <span class="ocr_line"/>.
https://github.com/tesseract-ocr/tesseract/blob/4.1.1/src/api/hocrrenderer.cpp#L398-L399

@HiromuHota
Copy link
Contributor Author

Currently, pdftotree does not properly handle multi-column document.

The 2nd page in the https://github.com/HazyResearch/pdftotree/blob/v0.5.0/tests/input/paleo.pdf looks like this:

image

The corresponding part of pdftotree output looks like this:

      <div class="ocrx_block" pdftotree="section_header" title="bbox 32 68 92 76">
        <span class="ocrx_line" title="bbox 32 68 92 76">
          <span class="ocrx_word" title="bbox 32 68 39 76">1.</span>
          <span class="ocrx_word" title="bbox 42 68 92 76">Introduction</span>
        </span>
      </div>
      <div class="ocrx_block" pdftotree="paragraph" title="bbox 301 67 552 148">
        <span class="ocrx_line" title="bbox 301 67 552 75">
          <span class="ocrx_word" title="bbox 301 67 334 75">different</span>
          <span class="ocrx_word" title="bbox 336 67 362 75">phases</span>
          <span class="ocrx_word" title="bbox 364 67 371 75">in</span>
          <span class="ocrx_word" title="bbox 373 67 391 75">their</span>
          <span class="ocrx_word" title="bbox 393 67 424 75">detailed</span>
          <span class="ocrx_word" title="bbox 426 67 500 75">microthermometric</span>
          <span class="ocrx_word" title="bbox 502 67 523 75">study</span>
          <span class="ocrx_word" title="bbox 525 67 533 75">of</span>
          <span class="ocrx_word" title="bbox 535 67 552 75">fluid</span>
        </span>
        <span class="ocrx_line" title="bbox 301 77 552 85">
          <span class="ocrx_word" title="bbox 301 77 339 85">inclusions</span>
          <span class="ocrx_word" title="bbox 341 77 349 85">in</span>
          <span class="ocrx_word" title="bbox 351 77 378 85">gangue</span>
          <span class="ocrx_word" title="bbox 380 77 413 85">minerals</span>
          <span class="ocrx_word" title="bbox 416 77 434 85">from</span>
          <span class="ocrx_word" title="bbox 436 77 440 85">a</span>
          <span class="ocrx_word" title="bbox 442 77 489 85">mineralized,</span>
          <span class="ocrx_word" title="bbox 492 77 516 85">lateral</span>
          <span class="ocrx_word" title="bbox 518 77 552 85">secretion</span>
        </span>

As can been seen above, the result is not in reading order; hence pdftotree should use ocrx_line for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant