Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Spec #225

Merged
merged 57 commits into from
Mar 1, 2023
Merged
Changes from 6 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
e2d4038
wip
kba Sep 6, 2022
0cc4a0b
Merge remote-tracking branch 'origin/master' into qa-spec
kba Sep 14, 2022
5aa6bd5
rewrite eval schema and saple according to OCR-D/zenhub#123
kba Sep 21, 2022
6cd0caf
add metrics to ocrd_eval.md
kba Sep 21, 2022
b529531
ocrd_eval: \begin{array}{ll} instead of .. {2}
kba Sep 22, 2022
18333b8
style(ocrd_eval.md): linting, formatting and correcting images
mweidling Sep 26, 2022
fe9d6ff
stlye: add new line
mweidling Sep 26, 2022
d7854a1
Apply suggestions from code review
kba Sep 27, 2022
ee67881
Apply suggestions from code review
kba Sep 27, 2022
5b35358
retcon JSON changes to YAML
mweidling Sep 27, 2022
1aa048c
comment EvaluationMetrics back in
kba Sep 27, 2022
5840476
generate minimal JSON from YAML src
kba Sep 27, 2022
c9d313f
comment out undiscussed CER metrics
mweidling Sep 27, 2022
a814c89
feat: move workflow_steps to ocr_workflow object
mweidling Nov 24, 2022
a881e08
remove schema from this branch, cf. #236
kba Dec 19, 2022
c7ae88d
integrate Uwe's feedback
mweidling Jan 23, 2023
7ce6c1a
Update ocrd_eval.md
mweidling Jan 23, 2023
ef8aeea
Update ocrd_eval.md
mweidling Jan 23, 2023
e2d2ec9
Update ocrd_eval.md
mweidling Feb 3, 2023
ad97594
Update ocrd_eval.md
mweidling Feb 3, 2023
34a78cb
Update ocrd_eval.md
mweidling Feb 3, 2023
c95ce0b
Update ocrd_eval.md
mweidling Feb 3, 2023
13b2bcd
Update ocrd_eval.md
mweidling Feb 3, 2023
b823afc
Update ocrd_eval.md
mweidling Feb 3, 2023
8ab1391
Apply suggestions from code review
mweidling Feb 3, 2023
5e1da31
Apply suggestions from code review
mweidling Feb 3, 2023
0183ea9
Update ocrd_eval.md
mweidling Feb 3, 2023
5519120
update character definition wrt. white spaces
mweidling Feb 7, 2023
dd2d63b
refine paragraph about characters
mweidling Feb 9, 2023
0253720
move character section before edit distance section
mweidling Feb 9, 2023
19deddd
add placeholder for letter accuracy
mweidling Feb 9, 2023
149b271
fix link
mweidling Feb 9, 2023
7d0bbf6
Update ocrd_eval.md
mweidling Feb 9, 2023
851aeb7
Update ocrd_eval.md
mweidling Feb 9, 2023
c81079a
Update ocrd_eval.md
mweidling Feb 10, 2023
fab6202
Update ocrd_eval.md
mweidling Feb 10, 2023
f678050
implement feedback
mweidling Feb 10, 2023
a680c70
be more precise about CER/WER granularity
mweidling Feb 10, 2023
5e94aa0
change GPU metrics
mweidling Feb 10, 2023
e8dc864
change citation hint
mweidling Feb 10, 2023
ee330c9
adjust WER definition
mweidling Feb 10, 2023
9ea4b62
Apply suggestions from code review
mweidling Feb 10, 2023
c910f0e
add bow metric
mweidling Feb 14, 2023
8c22169
format document
mweidling Feb 14, 2023
149a2eb
gpu mem instead of util
mweidling Feb 14, 2023
2999ef4
Update ocrd_eval.md
mweidling Feb 14, 2023
87f9438
GPU Peak Memory definition
mweidling Feb 14, 2023
5e80c94
Update ocrd_eval.md
mweidling Feb 14, 2023
5cd5efb
Update ocrd_eval.md
mweidling Feb 15, 2023
492b6ee
Update ocrd_eval.md
mweidling Feb 15, 2023
d8d4cef
Update ocrd_eval.md
mweidling Feb 15, 2023
48e69f8
add letter accuracy
mweidling Feb 15, 2023
3cc5bee
rephrase layout eval intro
mweidling Feb 15, 2023
f817521
add reading order evaluation
mweidling Feb 15, 2023
04c5c27
implement Uwe's feedback reg. Letter Accuracy
mweidling Feb 15, 2023
d078b1b
Apply suggestions from code review
mweidling Feb 16, 2023
43b364a
eval: Improvements to TeX formulas
kba Feb 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 52 additions & 27 deletions ocrd_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
## Rationale

Evaluating the quality of OCR requires comparing the OCR results on representative **ground truth** (GT)
– i.e. realistic data (images) with manual transcriptions (segmentation, text).
– i.e. realistic data (images) with manual transcriptions (segmentation, text).
OCR results can be obtained via several distinct **OCR workflows**.

The comparison requires evaluation tools which themselves build on a number of established
evaluation metrics.
evaluation metrics.
The evaluation results must be presented in a way that allows factorising and localising aberrations,
both within documents (page types, individual pages, region types, individual regions) and across classes of similar documents.

Expand Down Expand Up @@ -123,7 +123,7 @@ Currently we only provide CER per page; higher-level CER results might be calcul

##### Word Error Rate (WER)

Word error rate (WER) is analogous to CER: While CER operates on (differences between) characters,
Word error rate (WER) is analogous to CER: While CER operates on (differences between) characters,
WER measures the percentage of incorrectly recognized words in a text.

A **word** in that context is usually defined as any sequence of characters between white space (including line breaks), with leading and trailing punctuation removed (according to [Unicode TR29 Word Boundary algorithm](http://unicode.org/reports/tr29/#Word_Boundaries)).
Expand All @@ -132,7 +132,7 @@ CER and WER share categories of errors, and the WER is similarly calculated:

$WER = \frac{i_w + s_w + d_w}{i_w + s_w + d_w + c_w}$

where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $c_w$ the number of correctl words.
where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $c_w$ the number of correct words.

More specific cases of WER consider only the "significant" words, omitting e.g. stopwords from the calculation.

Expand All @@ -152,15 +152,15 @@ Example:

> Eine Mondfinsternis ist die Himmelsbegebenheit welche sich zur Zeit des Vollmondes ereignet, wenn die Erde zwischen der Sonne und dem Monde steht, so daß die Strahlen der Sonne von der Erde aufgehalten werden, und daß man so den Schatten der Erde in dem Monde siehet. In diesem Jahre sind zwey Monfinsternisse, davon ist ebenfalls nur Eine bey uns sichtbar, und zwar am 30sten März des Morgens nach 4 Uhr, und währt bis nach 6 Uhr.

To get the Bag of Words of this paragraph a set containing each word and its number of occurence is created:
To get the Bag of Words of this paragraph a multiset containing each word and its number of occurence is created:

$BoW$ =
$BoW_{GT}$ =

```json=
{
"Eine": 2, "Mondfinsternis": 1, "ist": 2, "die": 2, "Himmelsbegebenheit": 1,
"welche": 1, "sich": 1, "zur": 1, "Zeit": 1, "des": 2, "Vollmondes": 1,
"ereignet,": 1, "wenn":1, "Erde": 3, "zwischen": 1, "der": 4, "Sonne": 2,
"ereignet,": 1, "wenn": 1, "Erde": 3, "zwischen": 1, "der": 4, "Sonne": 2,
"und": 4, "dem": 2, "Monde": 2, "steht,": 1, "so": 2, "daß": 2,
"Strahlen": 1, "von": 1, "aufgehalten": 1, "werden,": 1, "man": 1, "den": 1,
"Schatten": 1, "in": 1, "siehet.": 1, "In": 1, "diesem": 1, "Jahre": 1,
Expand All @@ -171,6 +171,28 @@ $BoW$ =
}
```

mweidling marked this conversation as resolved.
Show resolved Hide resolved
##### Bag of Words Metric

The Bag of Words Metric describes how many words in a recognized text correspond to words given in the Ground Truth, independent of a page's layout.
mweidling marked this conversation as resolved.
Show resolved Hide resolved

$BoW_m = \frac{BoW_{GT} - |\Delta_{GT/recognized}|}{|n_{GT}|}$
mweidling marked this conversation as resolved.
Show resolved Hide resolved

###### Example

Given

$BoW_{GT} = \{"Eine": 1, "Mondfinsternis": 1, "steht": 1, "bevor": 1\}$

and

$BoW_{recognized} = \{"Eine": 1, "Mondfinsternis": 1, "fteht": 1, "bevor": 1\}$

results in:

$BoW_m = \frac{4 - 1}{4}$ = 0.75

In this example 75% of the words have been correctly recognized.
mweidling marked this conversation as resolved.
Show resolved Hide resolved

### Layout Evaluation
bertsky marked this conversation as resolved.
Show resolved Hide resolved

For documents with a complex structure, looking at the recognized text's accuracy alone is often insufficient to accurately determine the quality of OCR. An example can help to illustrate this: in a document containing two columns, all characters and words may be recognized correctly, but when the two columns are detected by layout analysis as just one, the OCR result will contain the text for the first lines of the first and second column, followed by the second lines of the first and second column asf., rendering the sequence of words and paragraphs in the Ground Truth text wrongly, which defeats almost all downstream processes.
Expand Down Expand Up @@ -257,13 +279,13 @@ The following metrics are not part of the MVP (minimal viable product) and will

#### GPU Metrics

##### GPU Avg Usage
##### GPU Avg Memory

GPU avg usage is the average GPU load during the execution of a workflow represented by a real number between 0 and 1.
GPU avg memory refers to the average amount of memory of the GPU (in GiB) that was used during processing.
bertsky marked this conversation as resolved.
Show resolved Hide resolved

##### GPU Peak Usage
##### GPU Peak Memory

GPU peak usage is the maximum GPU load during the execution of a workflow represented by a real number between 0 and 1.
GPU peak memory is the maximum GPU memory allocated during the execution of a workflow in MB.

#### Text Evaluation

Expand All @@ -273,12 +295,12 @@ TODO

##### Flexible Character Accuracy Measure

The Flexible Character Accuracy (FCA) measure has been introduced to mitigate a major drawback of CER:
The Flexible Character Accuracy (FCA) measure has been introduced to mitigate a major drawback of CER:
CER (if applied naively by comparing concatenated page-level texts) is heavily dependent on the reading order an OCR engine detects.
Thus, where text blocks are rearranged or merged, no suitable text alignment can be made, so CER is very low,
even if single characters, words and even lines have been perfectly recognized.

FCA avoids this by splitting the recognized text and GT into lines and, if necessary, sub-line chunks,
FCA avoids this by splitting the recognized text and GT into lines and, if necessary, sub-line chunks,
finding pairs that align maximally until only unmatched lines remain (which must be treated as errors),
and measuring average CER of all pairs.

Expand Down Expand Up @@ -311,9 +333,9 @@ The following paragraphs will first introduce the intermediate concepts needed t

###### Precision and Recall

**Precision** describes to which degree the predictions of a model are correct.
**Precision** describes to which degree the predictions of a model are correct.
The higher the precision of a model, the more confidently we can assume that each prediction is correct
(e.g. the model having identified a bicycle in an image actually depicts a bicycle).
(e.g. the model having identified a bicycle in an image actually depicts a bicycle).
A precision of 1 (or 100%) indicates all predictions are correct (true positives) and no predictions are incorrect (false positives). The lower the precision value, the more false positives.

In the context of object detection in images, it measures either
Expand All @@ -329,16 +351,18 @@ The higher the recall of a model, the more confidently we can assume that it cov
A recall of 1 (or 100%) indicates that all objects have a correct prediction (true positives) and no predictions are missing or mislabelled (false negatives). The lower the recall value, the more false negatives.

In the context of object detection in images, it measures either
- the ratio of correctly detected segments over all actual segments, or
- the ratio of correctly segmented pixels over the image size.

* the ratio of correctly detected segments over all actual segments, or
* the ratio of correctly segmented pixels over the image size.

Notice that both goals are naturally conflicting each other. A good predictor needs both high precision and recall.
But the optimal trade-off depens on the application.

For layout analysis though, the underlying notion of sufficient overlap itself is inadequate:
- it does not discern oversegmentation from undersegmentation
- it does not discern splits/merges that are allowable (irrelevant w.r.t. text flow) or not (break up or conflate lines)
- it does not discern foreground from background, or when partial overlap starts breaking character legibility or introducing ghost characters

* it does not discern oversegmentation from undersegmentation
* it does not discern splits/merges that are allowable (irrelevant w.r.t. text flow) or not (break up or conflate lines)
* it does not discern foreground from background, or when partial overlap starts breaking character legibility or introducing ghost characters

###### Prediction Score

Expand All @@ -352,7 +376,7 @@ Whether this prediction is then considered to be a positive detection, depends o

For object detection, the metrics precision and recall are usually defined in terms of a threshold for the degree of overlap
(represented by the IoU as defined [above](#iou-intersection-over-union)), ranging between 0 and 1)
above which pairs of detected and GT segments are qualified as matches.
above which pairs of detected and GT segments are qualified as matches.

(Predictions that are non-matches across all GT objects – false positives – and GT objects that are non-matches across all predictions – false negatives – contribute indirectly in the denominator.)

Expand All @@ -364,7 +388,7 @@ Therefore, the union of that pair is more than double the intersection. But sinc

###### Precision-Recall Curve

By varying the prediction threshold (and/or the IoU threshold), the tradeoff between precision and recall can be tuned.
By varying the prediction threshold (and/or the IoU threshold), the tradeoff between precision and recall can be tuned.
When the full range of combinations has been gauged, the result can be visualised in a precision-recall curve (or receiver operator characteristic, ROC).
Usually the optimum balance is where the product of precision and recall (i.e. area under the curve) is maximal.

Expand All @@ -388,8 +412,8 @@ This graph is called Precision-Recall-Curve.

###### Average Precision

Average Precision (AP) describes how well (flexible and robust) a model can detect objects in an image,
by averaging precision over the full range (from 0 to 1) of confidence thresholds (and thus, recall results).
Average Precision (AP) describes how well (flexible and robust) a model can detect objects in an image,
by averaging precision over the full range (from 0 to 1) of confidence thresholds (and thus, recall results).
It is equal to the area under the Precision-Recall Curve.

![A sample precision/recall curve with highlighted area under curve](https://pad.gwdg.de/uploads/799e6a05-e64a-4956-9ede-440ac0463a3f.png)
Expand All @@ -412,7 +436,7 @@ AP & = \displaystyle\sum_{k=0}^{k=n-1}[r(k) - r(k+1)] * p(k) \\
\end{array}
$$

mweidling marked this conversation as resolved.
Show resolved Hide resolved
Usually, AP calculation also involves _smoothing_ (i.e. clipping local minima) and _interpolation_ (i.e. adding data points between the measured confidence thresholds).
Usually, AP calculation also involves *smoothing* (i.e. clipping local minima) and *interpolation* (i.e. adding data points between the measured confidence thresholds).

###### Mean Average Precision

Expand All @@ -425,14 +449,15 @@ $mAP = \displaystyle\frac{1}{N}\sum_{i=1}^{N}AP_i$ with $N$ being the number of

mweidling marked this conversation as resolved.
Show resolved Hide resolved
Often, this mAP for a range of IoU thresholds gets complemented by additional mAP runs for a set of fixed values, or for various classes and object sizes only.
The common understanding is that those different measures collectively allow drawing better conclusions and comparisons about the model's quality.

##### Scenario-Driven Performance Evaluation

Scenario-driven, layout-dedicated, text-flow informed performance evaluation as described in
Scenario-driven, layout-dedicated, text-flow informed performance evaluation as described in
[Clausner et al., 2011](https://primaresearch.org/publications/ICDAR2011_Clausner_PerformanceEvaluation)
is currently the most comprehensive and sophisticated approach to evaluate the quality of layout analysis.

It is not a single metric, but comprises a multitude of measures derived in a unified method, which considers
the crucial effects that segmentation can have on text flow, i.e. which kinds of overlaps (merges and splits)
the crucial effects that segmentation can have on text flow, i.e. which kinds of overlaps (merges and splits)
amount to benign deviations (extra white-space) or pathological ones (breaking lines and words apart).
In this approach, all the derived measures are aggregated under various sets of weights, called evaluation scenarios,
which target specific use cases (like headline or keyword extraction, linear fulltext, newspaper or figure extraction).
Expand Down