Skip to content

Latest commit

 

History

History
69 lines (50 loc) · 1.81 KB

README.md

File metadata and controls

69 lines (50 loc) · 1.81 KB

hOCRpy

This package extracts text, bounding box, and confidence score information from the structured output of OCR systems like Tesseract. This output, which is called hOCR, is a useful data representation for identifying page formats, messy OCR, and more. In addition to providing a wrapper around hOCR data, hOCRpy enables page rendering (for corpus exploration) and several ways of analyzing the data, including:

  1. Bounding box metrics
  2. Format prediction

Basic Usage

hOCRpy will automatically parse a hOCR file from a filepath.

from hOCRpy import hOCR

path = 'examples/hocr/one_column.hocr'
hocr = hOCR(path)

# Get tokens, their bounding boxes, and their confidence scores
for token, bbox, score in zip(hocr.tokens, hocr.bboxes, hocr.scores):
    print(token, bbox, score)
>> The [193, 157, 245, 180] 0.96
>> Life [256, 157, 304, 180] 0.96
>> and [315, 158, 360, 181] 0.96
>> Work [371, 158, 445, 181] 0.96
>> of [456, 158, 483, 181] 0.96
>> [...]

# Return a plaintext blob
hocr.text
>> 'The Life and Work of...'

# Number of tokens
hocr.num_tokens
>> 324

# Average confidence score
import numpy as np

np.mean(hocr.scores)
>> 0.9364197530864197

During corpus exploration, it's often helpful to get a high-level overview of a page's structure.

hocr.show_structure(which='token')

Option options include area, paragraph, and line. In addition to these, it's possible to re-render the entire page, fitting each token back into its respective bounding box.

hocr.show_page(outline=None, scale=True)

See analysis.ipynb for a demonstration of how hOCRpy may be used to analyze hOCR data.