hOCRpy

This package extracts text, bounding box, and confidence score information from the structured output of OCR systems like Tesseract. This output, which is called hOCR, is a useful data representation for identifying page formats, messy OCR, and more. In addition to providing a wrapper around hOCR data, hOCRpy enables page rendering (for corpus exploration) and several ways of analyzing the data, including:

Bounding box metrics
Format prediction

Basic Usage

hOCRpy will automatically parse a hOCR file from a filepath.

from hOCRpy import hOCR

path = 'examples/hocr/one_column.hocr'
hocr = hOCR(path)

# Get tokens, their bounding boxes, and their confidence scores
for token, bbox, score in zip(hocr.tokens, hocr.bboxes, hocr.scores):
    print(token, bbox, score)
>> The [193, 157, 245, 180] 0.96
>> Life [256, 157, 304, 180] 0.96
>> and [315, 158, 360, 181] 0.96
>> Work [371, 158, 445, 181] 0.96
>> of [456, 158, 483, 181] 0.96
>> [...]

# Return a plaintext blob
hocr.text
>> 'The Life and Work of...'

# Number of tokens
hocr.num_tokens
>> 324

# Average confidence score
import numpy as np

np.mean(hocr.scores)
>> 0.9364197530864197

During corpus exploration, it's often helpful to get a high-level overview of a page's structure.

hocr.show_structure(which='token')

Option options include area, paragraph, and line. In addition to these, it's possible to re-render the entire page, fitting each token back into its respective bounding box.

hocr.show_page(outline=None, scale=True)

See analysis.ipynb for a demonstration of how hOCRpy may be used to analyze hOCR data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

hOCRpy

Basic Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

hOCRpy

Basic Usage