Evaluation

OCR-D - evaluation of workflows and intermediate results

Problem statement

Which data and tools can we use to objectively measure quality and compare results of both complete workflows and individual steps (beyond final text Character Error Rate) on a non-representative sample?

Tasks

Preprocessing

Pixel-by-pixel comparison (e.g. for binarization: what percentage of black pixels in the output are also black in the GT)
Connected-component statistics, specialised measures (e.g. of vertical/horizontal projection profiles)
RGB/Grayscale entropy or other general purpose image measures
downstream segmentation quality/accuracy
downstream recognition quality (gotcha)
...

Segmentation

https://github.com/OCR-D/ocrd_segment/wiki/SegmentationEvaluation

Recognition / Post-correction

Edit distance of characters/words after text alignment with GT
Edit distance or precision/recall after indexing GT (ignoring reading order and/or textline order and/or reading direction – "bag of words")
(only for post-correction:) precision/recall of correction

complementing historical dictionaries:

canonicalization of historic orthography (via hybrid stochastic and linguistic modelling) at BBAW: http://www.deutschestextarchiv.de/doku/software#cab note: has no rejection or confidence scoring, so every historical input will get an analysis
decanonicalization of modern orthography (called historical patterns) at CIS: https://github.com/cisocrgroup/Resources/tree/master/lexica (includes references) note: is always fuzzy, so inputs will always produce many different historical candidates

Tools

PRImA LayoutEval

https://www.primaresearch.org/alternative_download_links.html

https://www.dropbox.com/s/ky53r9k79tb0ywz/LayoutEvaluation_1.9.132.zip?dl=0

(partial source code:) https://github.com/PRImA-Research-Lab/prima-layout-eval (partial documentation:) https://github.com/PRImA-Research-Lab/prima-layout-eval/blob/master/doc/liblayouteval.pdf

Dinglehopper

https://github.com/qurator-spk/dinglehopper

… text alignment CER/WER (mean) per page, visual comparison

cor-asv-ann-evaluate

https://github.com/ASVLeipzig/cor-asv-ann

… text alignment CER (mean+variance) with (multi-OCR) aggregation across pages/documents, confusion statistics, various metrics and normalization options (GT levels)

Leptonica / pyleptonica

https://github.com/jsbueno/pyleptonica

[name=Robert Sachunsky] Why does this PR appear in the list for eval tools?

[name=kba] Copy/Paste error :) The PR is essential though to move forward with a potential pyleptonica-based dewarping OCR-D wrapper

impactcentre/ocrevalUAtion

https://github.com/impactcentre/ocrevalUAtion

Data

Where can we find challenging, yet well-annotated data to test such evaluations?

structural GT (for preprocessing and segmentation)

How about OCR-D structure GT (1000pages DTA) @tboenig ?

textual GT (for recognition)

How about OCR-D structure+text GT?

Articles

Automatische Qualitätsverbesserung von Fraktur-Volltexten aus der Retrodigitalisierung am Beispiel der Zeitschrift Die Grenzboten

Outlook

Which eval tools could be wrapped with an OCR-D CLI with manageable effort?
What methodology do we use to evaluate processors, parameters and workflows? (GT curation, error/quality aggregation/slicing across meta-data, overall CER vs step-specific measures, )
How should evaluation fit into an OCR-D workflow runtime management (i.e. should missing a threshold value trigger the workflow to fail? The page? Should we reuse the ValidationReport mechanics used elsewhere?)

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly