PagePlus is a Python-based command-line tool for processing and analyzing PAGE XML files, which are commonly used in document layout analysis. This tool provides a variety of functions to modify and extract data from these files, providing an efficient way to handle text and region-based information in document images.
PagePlus includes several commands to perform operations such as
Analytics: Gathers detailed statistics about the contents of PAGE XML files, including counts of text regions, table regions, lines of text, words, and glyphs. This feature is essential for understanding the scope and size of processed documents.
Validation: Ensures the integrity of text regions and lines in PAGE XML files, checking for and reporting any inconsistencies or errors.
Repair: Includes a set of repair functions that can fix common problems in PAGE XML files. This functionality is essential for improving the usability of processed files.
Extend regions, lines and baselines: Buffers regions, lines and baselines.
Translating regions, lines and baselines: Translates regions, lines and baslines by an offset.
Textline sorting and merging/splitting: Sorts and merges text rows in PAGE XML files based on specified horizontal and vertical gap thresholds.
Baseline and Textline modification: Processes text lines to extend baselines, fit textline polygons into parent regions, and calculate pseudo textline polygons.
Region and word-level text deletion: Provides the ability to delete text at various hierarchical levels within PAGE XML files.
Fulltext extraction: Extracts all text from PAGE XML files, optionally delimiting lines, and saves the output as plain text files.
DSV (Delimiter-Separated Value) extraction: Creates delimiter-separated files (such as CSV or TSV) from PAGE XML files, including details such as line IDs, text content, region numbers, baseline coordinates, and dimensions.
To install PagePlus, you will need Python and poetry installed on your system. Clone the repository or download the source code and run the following command in the root directory:
poetry install
PagePlus can be executed from the command line. Here are some examples of how to use its features:
poetry shell
pageplus analytics statistics /path/to/xml/files
pageplus validation validate-all /path/to/xml/files
pageplus modification repair /path/to/xml/files
pageplus modification extend-lines /path/to/xml/files
pageplus modification pseudolinepolygon /path/to/xml/files
pageplus modification delete_text /path/to/xml/files
pageplus export fulltext /path/to/xml/files --outputdir /path/to/output
pageplus export dsv /path/to/xml/files --delimiter ',' --outputdir /path/to/output
Most of the commands in PagePlus offer configurable options such as Outputdir for specifying the output directory, and other parameters for customizing the file processing. Use the --help flag with any command to see all available options:
pageplus --help
Contributions to PagePlus are welcome! If you find a bug or have a feature request, please open an issue.
PagePlus was created during the 3rd funding phase of the OCR-D project and predominantly funded by the German Research Foundation (DFG).