textract2page

Convert AWS Textract JSON to PRImA PAGE XML

Introduction

This software converts OCR results from Amazon AWS Textract Response files to PRImA PAGE XML files.

Installation

pip install textract2page

Usage

The package contains a file-based conversion function provided as CLI and Python API. The function takes the Textract JSON file and the original image file which was used as input for the OCR. (That is necessary because Textract stores coordinates in float ratios, whereas PAGE uses int in pixel indices.)

Python API

To convert a Textract file example.json for an image file example.jpg to a PAGE example.xml:

from textract2page import convert_file

convert_file("example.json", "example.jpg", "example.xml")

CLI

Analogously, on the command line interface:

textract2page example.json example.jpg > example.xml
textract2page -O example.xml example.json example.jpg

You can get a list of options with --help or -h

Testing

Requires installation and a local copy of the repository.

To run regression tests with pytest, do

make deps-test
make test-api

To run regression test via command line, do

# optionally:
sudo apt-get install xmlstarlet
make test-cli

(If xmlstarlet is available, then the CLI test will also validate the result tree. Otherwise, this just checks the command completes without error.)

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
tests		tests
textract2page		textract2page
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textract2page

Introduction

Installation

Usage

Python API

CLI

Testing

About

Releases 3

Packages

Contributors 2

Languages

License

slub/textract2page

Folders and files

Latest commit

History

Repository files navigation

textract2page

Introduction

Installation

Usage

Python API

CLI

Testing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages