Skip to content

Workflow Guide post correction

Robert Sachunsky edited this page Feb 3, 2022 · 6 revisions

In this processing step, the recognized text is corrected by statistical error modelling, language modelling, and word modelling (dictionaries, morphology and orthography).

Note: Most tools benefit strongly from input which includes alternative OCR hypotheses. Currently, models for ocrd-cor-asv-ann-process are optimised for input from specific OCR models, whereas ocrd-cis-postcorrect expects input from multi-OCR alignment. For more information, see this presentation at vDHd 2021 (held on 23rd May 2021) (slides / video in German)

Note: There is some overlap with text alignment here, which can also be used (or contribute to) post-correction.

Available processors

Processor Parameter Remarks Call
ocrd-cor-asv-ann-process -P textequiv_level word -P model_file modelname Pre-trained models can be found here and here or downloaded via the OCR-D resource manager;
If you didn't download the model with resmgr, for model_file you need to pass the local filesystem path as parameter value. (Relative paths are resolved from the workspace directory or the environment variable CORASVANN_DATA.) There is no default model_file.
ocrd-cor-asv-ann-process -I OCR-D-OCR -O OCR-D-PROCESS -P textequiv_level word -P model_file /path/to/model/model.h5
ocrd-cis-postcorrect -P profilerPath /path/to/profiler.bash -P profilerConfig ignored -P nOCR 2 -P model /path/to/model/model.zip The profilerConfig parameters can be specified in a JSON file. If you do not want to use a profiler, you can set the value for profilerConfig to ignored. In this case, your profiler.bash should look like this:

#!/bin/bash
cat > /dev/null
echo '{}'
For model you need to pass the local filesystem path as parameter value. There is no default model.
ocrd-cis-postcorrect -I OCR-D-ALIGN -O OCR-D-CORRECT -p postcorrect.json

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally