-
Notifications
You must be signed in to change notification settings - Fork 7
Workflow Guide format conversion
Robert Sachunsky edited this page Feb 8, 2022
·
7 revisions
In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.
Processor | Parameter | Remarks | Call |
---|---|---|---|
ocrd-fileformat-transform |
|
As the value consists of two words, when using -P form it has to be enclosed in quotation marks.If you want to save all OCR results in one file, you can use the following command: cat OCR* > full.txt |
ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO |
mets-mods2tei | --ocr -T FULLTEXT -I OCR-D-IMG |
Not a processor CLI, processes the workspace METS, generating a single TEI (DTABf-formatted) for the whole document. Only takes ALTO input, so usually needs a prior ocrd-fileformat-transform -I OCR-D-OCR -O FULLTEXT -P from-to "page alto" .
|
mm2tei --ocr -T FULLTEXT -I OCR-D-IMG -O TEI.xml |
ocrd-page2tei | generates a single TEI (using page2tei XSLT with Saxon) for the whole document. | ocrd-page2tei -I OCR-D-OCR -O OCR-D-TEI |
|
ocrd-pagetopdf |
|
ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word |
|
ocrd-segment-extract-pages | -P mimetype image/png -P transparency true |
Get page images (cropped and deskewed as annotated; raw and binarized) and mask images (color-coded for regions) along with JSON files for region annotations (custom and COCO format). | ocrd-segment-extract-pages -I OCR-D-SEG-REGION -O OCR-D-IMG-PAGE,OCR-D-IMG-PAGE-BIN,OCR-D-IMG-PAGE-MASK |
ocrd-segment-extract-regions | -P mimetype image/png -P transparency true |
Get region images (cropped, masked and deskewed as annotated) along with JSON files for region annotations (custom format). | ocrd-segment-extract-regions -I OCR-D-SEG-REGION -O OCR-D-IMG-REGION |
ocrd-segment-extract-lines | -P mimetype image/png -P transparency true |
Get text line images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for line annotations (custom format). | ocrd-segment-extract-lines -I OCR-D-SEG-LINE -O OCR-D-IMG-LINE |
ocrd-segment-extract-words | -P mimetype image/png -P transparency true |
Get word images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for word annotations (custom format). | ocrd-segment-extract-words -I OCR-D-SEG-WORD -O OCR-D-IMG-WORD |
ocrd-segment-extract-glyphs | -P mimetype image/png -P transparency true |
Get glyph images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for glyph annotations (custom format). | ocrd-segment-extract-glyphs -I OCR-D-SEG-GLYPH -O OCR-D-IMG-GLYPH |
ocrd-segment-from-masks |
|
Import mask images as region segmentation. If colordict is empty, defaults to PageViewer color scheme (also written by ocrd-segment-extract-pages ). |
ocrd-segment-from-masks -I OCR-D-SEG-PAGE,OCR-D-IMG-PAGE-MASK -O OCR-D-SEG-REGION |
ocrd-segment-from-coco | Import COCO format region segmentation (also written by ocrd-segment-extract-pages ). |
ocrd-segment-from-coco -I OCR-D-SEG-PAGE,OCR-D-SEG-COCO -O OCR-D-SEG-REGION |
E.g.
- which parameters do you use with what values?
- which parameters are insufficiently documented?
- which aspects of a processor should be parameterizable but are not?
E.g. which processors worked best with what material? -- feel free to post sample images here, too.
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows