Workflow Guide format conversion

In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.

Available processors

Processor	Parameter	Remarks	Call
ocrd-fileformat-transform	`-P from-to "alto2.0 alto3.0" # or "alto2.0 alto3.1" # or "alto2.0 hocr" # or "alto2.1 alto3.0" # or "alto2.1 alto3.1" # or "alto2.1 hocr" # or "alto page" # or "alto text" # or "gcv hocr" # or "hocr alto2.0" # or "hocr alto2.1" # or "hocr text" # or "page alto" # or "page hocr" # or "page text"`	As the value consists of two words, when using `-P` form it has to be enclosed in quotation marks. If you want to save all OCR results in one file, you can use the following command: cat OCR* > full.txt	`ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO`
mets-mods2tei	--ocr -T FULLTEXT -I OCR-D-IMG	Not a processor CLI, processes the workspace METS, generating a single TEI (DTABf-formatted) for the whole document. Only takes ALTO input, so usually needs a prior `ocrd-fileformat-transform -I OCR-D-OCR -O FULLTEXT -P from-to "page alto"`.	`mm2tei --ocr -T FULLTEXT -I OCR-D-IMG -O TEI.xml`
ocrd-page2tei		generates a single TEI (using page2tei XSLT with Saxon) for the whole document.	`ocrd-page2tei -I OCR-D-OCR -O OCR-D-TEI`
ocrd-pagetopdf	`{ # font file name to use for rendering text "font": "AletheiaSans.ttf", # fix (invalid) negative coordinates "negative2zero": true, # concatenate to multi-page PDF (empty for none) "multipage": "name_of_pdf", # multi-page PDF page labels "pagelabel": "pageId", # render text on this hierarchy level "textequiv_level": "word", # draw polygon outlines in the PDF (empty for none) "outlines": "line" }`		`ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word`
ocrd-segment-extract-pages	`-P mimetype image/png -P transparency true`	Get page images (cropped and deskewed as annotated; raw and binarized) and mask images (color-coded for regions) along with JSON files for region annotations (custom and COCO format).	`ocrd-segment-extract-pages -I OCR-D-SEG-REGION -O OCR-D-IMG-PAGE,OCR-D-IMG-PAGE-BIN,OCR-D-IMG-PAGE-MASK`
ocrd-segment-extract-regions	`-P mimetype image/png -P transparency true`	Get region images (cropped, masked and deskewed as annotated) along with JSON files for region annotations (custom format).	`ocrd-segment-extract-regions -I OCR-D-SEG-REGION -O OCR-D-IMG-REGION`
ocrd-segment-extract-lines	`-P mimetype image/png -P transparency true`	Get text line images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for line annotations (custom format).	`ocrd-segment-extract-lines -I OCR-D-SEG-LINE -O OCR-D-IMG-LINE`
ocrd-segment-extract-words	`-P mimetype image/png -P transparency true`	Get word images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for word annotations (custom format).	`ocrd-segment-extract-words -I OCR-D-SEG-WORD -O OCR-D-IMG-WORD`
ocrd-segment-extract-glyphs	`-P mimetype image/png -P transparency true`	Get glyph images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for glyph annotations (custom format).	`ocrd-segment-extract-glyphs -I OCR-D-SEG-GLYPH -O OCR-D-IMG-GLYPH`
ocrd-segment-from-masks	`-P colordict '{ "#969696": "TableRegion", "#00FF00": "TextRegion:page-number", "#FFFF00": "TextRegion:heading", "#00FFFF": "GraphicRegion:logo", "#0000FF": "TextRegion:subject", "#FF0000": "TextRegion:catch-word", "#FF00FF": "TextRegion:footnote", "#646464": "TextRegion:paragraph" }'`	Import mask images as region segmentation. If `colordict` is empty, defaults to PageViewer color scheme (also written by `ocrd-segment-extract-pages`).	`ocrd-segment-from-masks -I OCR-D-SEG-PAGE,OCR-D-IMG-PAGE-MASK -O OCR-D-SEG-REGION`
ocrd-segment-from-coco		Import COCO format region segmentation (also written by `ocrd-segment-extract-pages`).	`ocrd-segment-from-coco -I OCR-D-SEG-PAGE,OCR-D-SEG-COCO -O OCR-D-SEG-REGION`

Notes on parameter usage

E.g.

which parameters do you use with what values?
which parameters are insufficiently documented?
which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow Guide format conversion

Available processors

Notes on parameter usage

Notes on document-specific usage

Clone this wiki locally