Skip to content

Workflow Guide format conversion

Robert Sachunsky edited this page Feb 8, 2022 · 7 revisions

In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.

Available processors

Processor Parameter Remarks Call
ocrd-fileformat-transform
-P from-to "alto2.0 alto3.0"
      # or "alto2.0 alto3.1"
      # or "alto2.0 hocr"
      # or "alto2.1 alto3.0"
      # or "alto2.1 alto3.1"
      # or "alto2.1 hocr"
      # or "alto page"
      # or "alto text"
      # or "gcv hocr"
      # or "hocr alto2.0"
      # or "hocr alto2.1"
      # or "hocr text"
      # or "page alto"
      # or "page hocr"
      # or "page text"
      
As the value consists of two words, when using -P form it has to be enclosed in quotation marks.
If you want to save all OCR results in one file, you can use the following command:
cat OCR* > full.txt
ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO
mets-mods2tei
--ocr -T FULLTEXT -I OCR-D-IMG
Not a processor CLI, processes the workspace METS, generating a single TEI (DTABf-formatted) for the whole document. Only takes ALTO input, so usually needs a prior ocrd-fileformat-transform -I OCR-D-OCR -O FULLTEXT -P from-to "page alto". mm2tei --ocr -T FULLTEXT -I OCR-D-IMG -O TEI.xml
ocrd-page2tei generates a single TEI (using page2tei XSLT with Saxon) for the whole document. ocrd-page2tei -I OCR-D-OCR -O OCR-D-TEI
ocrd-pagetopdf
{
  # font file name to use for rendering text
  "font": "AletheiaSans.ttf",
  # fix (invalid) negative coordinates
  "negative2zero": true,
  # concatenate to multi-page PDF (empty for none)
  "multipage": "name_of_pdf",
  # multi-page PDF page labels
  "pagelabel": "pageId",
  # render text on this hierarchy level
  "textequiv_level": "word",
  # draw polygon outlines in the PDF (empty for none)
  "outlines": "line"
}
ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word
ocrd-segment-extract-pages -P mimetype image/png -P transparency true Get page images (cropped and deskewed as annotated; raw and binarized) and mask images (color-coded for regions) along with JSON files for region annotations (custom and COCO format). ocrd-segment-extract-pages -I OCR-D-SEG-REGION -O OCR-D-IMG-PAGE,OCR-D-IMG-PAGE-BIN,OCR-D-IMG-PAGE-MASK
ocrd-segment-extract-regions -P mimetype image/png -P transparency true Get region images (cropped, masked and deskewed as annotated) along with JSON files for region annotations (custom format). ocrd-segment-extract-regions -I OCR-D-SEG-REGION -O OCR-D-IMG-REGION
ocrd-segment-extract-lines -P mimetype image/png -P transparency true Get text line images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for line annotations (custom format). ocrd-segment-extract-lines -I OCR-D-SEG-LINE -O OCR-D-IMG-LINE
ocrd-segment-extract-words -P mimetype image/png -P transparency true Get word images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for word annotations (custom format). ocrd-segment-extract-words -I OCR-D-SEG-WORD -O OCR-D-IMG-WORD
ocrd-segment-extract-glyphs -P mimetype image/png -P transparency true Get glyph images (cropped, masked and deskewed as annotated) along with text files (Ocropus convention) and JSON files for glyph annotations (custom format). ocrd-segment-extract-glyphs -I OCR-D-SEG-GLYPH -O OCR-D-IMG-GLYPH
ocrd-segment-from-masks
-P colordict '{
  "#969696": "TableRegion", 
  "#00FF00": "TextRegion:page-number", 
  "#FFFF00": "TextRegion:heading", 
  "#00FFFF": "GraphicRegion:logo", 
  "#0000FF": "TextRegion:subject", 
  "#FF0000": "TextRegion:catch-word", 
  "#FF00FF": "TextRegion:footnote", 
  "#646464": "TextRegion:paragraph" }'
Import mask images as region segmentation. If colordict is empty, defaults to PageViewer color scheme (also written by ocrd-segment-extract-pages). ocrd-segment-from-masks -I OCR-D-SEG-PAGE,OCR-D-IMG-PAGE-MASK -O OCR-D-SEG-REGION
ocrd-segment-from-coco Import COCO format region segmentation (also written by ocrd-segment-extract-pages). ocrd-segment-from-coco -I OCR-D-SEG-PAGE,OCR-D-SEG-COCO -O OCR-D-SEG-REGION

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally