ocrx_line is missing from the ocr-capabilities metadata field #94

HiromuHota · 2020-10-20T16:27:10Z

Describe the bug

An output hOCR contains <meta content="ocr_page ocr_table ocrx_block ocrx_word" name="ocr-capabilities"/>, which misses ocrx_line.

To Reproduce
Steps to reproduce the behavior:

$ pdftotree some.pdf -o some.hocr

Expected behavior

The metadata field should contain ocrx_line.

Error Logs/Screenshots
If applicable, add error logs or screenshots to help explain your problem.

Environment (please complete the following information):

pdftotree Version: v0.5.0

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

HiromuHota · 2020-10-20T16:44:47Z

Should this be ocr_line instead of ocrx_line?
kba/hocr-spec#19

ocrodjvu uses <span class="ocrx_line"/> to represent TEXT_ZONE_LINE in djvu.
https://github.com/jwilk/ocrodjvu/blob/0.12/lib/hocr.py

Tesseract uses <span class="ocr_line"/>.
https://github.com/tesseract-ocr/tesseract/blob/4.1.1/src/api/hocrrenderer.cpp#L398-L399

HiromuHota · 2020-10-20T17:13:32Z

Currently, pdftotree does not properly handle multi-column document.

The 2nd page in the https://github.com/HazyResearch/pdftotree/blob/v0.5.0/tests/input/paleo.pdf looks like this:

The corresponding part of pdftotree output looks like this:

      <div class="ocrx_block" pdftotree="section_header" title="bbox 32 68 92 76">
        <span class="ocrx_line" title="bbox 32 68 92 76">
          <span class="ocrx_word" title="bbox 32 68 39 76">1.</span>
          <span class="ocrx_word" title="bbox 42 68 92 76">Introduction</span>
        </span>
      </div>
      <div class="ocrx_block" pdftotree="paragraph" title="bbox 301 67 552 148">
        <span class="ocrx_line" title="bbox 301 67 552 75">
          <span class="ocrx_word" title="bbox 301 67 334 75">different</span>
          <span class="ocrx_word" title="bbox 336 67 362 75">phases</span>
          <span class="ocrx_word" title="bbox 364 67 371 75">in</span>
          <span class="ocrx_word" title="bbox 373 67 391 75">their</span>
          <span class="ocrx_word" title="bbox 393 67 424 75">detailed</span>
          <span class="ocrx_word" title="bbox 426 67 500 75">microthermometric</span>
          <span class="ocrx_word" title="bbox 502 67 523 75">study</span>
          <span class="ocrx_word" title="bbox 525 67 533 75">of</span>
          <span class="ocrx_word" title="bbox 535 67 552 75">ﬂuid</span>
        </span>
        <span class="ocrx_line" title="bbox 301 77 552 85">
          <span class="ocrx_word" title="bbox 301 77 339 85">inclusions</span>
          <span class="ocrx_word" title="bbox 341 77 349 85">in</span>
          <span class="ocrx_word" title="bbox 351 77 378 85">gangue</span>
          <span class="ocrx_word" title="bbox 380 77 413 85">minerals</span>
          <span class="ocrx_word" title="bbox 416 77 434 85">from</span>
          <span class="ocrx_word" title="bbox 436 77 440 85">a</span>
          <span class="ocrx_word" title="bbox 442 77 489 85">mineralized,</span>
          <span class="ocrx_word" title="bbox 492 77 516 85">lateral</span>
          <span class="ocrx_word" title="bbox 518 77 552 85">secretion</span>
        </span>

As can been seen above, the result is not in reading order; hence pdftotree should use ocrx_line for now.

HiromuHota self-assigned this Oct 20, 2020

HiromuHota added the bug label Oct 20, 2020

HiromuHota mentioned this issue Oct 20, 2020

List a missing "ocrx_line" in the ocr-capabilities metadata field #95

Merged

4 tasks

HiromuHota closed this as completed in #95 Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocrx_line is missing from the ocr-capabilities metadata field #94

ocrx_line is missing from the ocr-capabilities metadata field #94

HiromuHota commented Oct 20, 2020

HiromuHota commented Oct 20, 2020

HiromuHota commented Oct 20, 2020

ocrx_line is missing from the ocr-capabilities metadata field #94

ocrx_line is missing from the ocr-capabilities metadata field #94

Comments

HiromuHota commented Oct 20, 2020

HiromuHota commented Oct 20, 2020

HiromuHota commented Oct 20, 2020