Convert AWS Textract JSON to PRImA PAGE XML
This software converts OCR results from Amazon AWS Textract Response files to PRImA PAGE XML files.
In a Python virtualenv:
pip install textract2page
The package contains a file-based conversion function provided as CLI and Python API.
The function takes the Textract JSON file and the original image file which was used
as input for the OCR. (That is necessary because Textract stores coordinates in
float
ratios, whereas PAGE uses int
in pixel indices.)
To convert a Textract file example.json
for an image file example.jpg
to a PAGE example.xml
:
from textract2page import convert_file
convert_file("example.json", "example.jpg", "example.xml")
Analogously, on the command line interface:
textract2page example.json example.jpg > example.xml
textract2page -O example.xml example.json example.jpg
You can get a list of options with --help
or -h
Requires installation and a local copy of the repository.
To run regression tests with pytest
, do
make deps-test
make test-api
To run regression test via command line, do
# optionally:
sudo apt-get install xmlstarlet
make test-cli
(If xmlstarlet
is available, then the CLI test will
also validate the result tree. Otherwise, this just
checks the command completes without error.)