OCR-D · kba · Mar 1, 2023 · Sep 6, 2022 · Sep 14, 2022 · Sep 21, 2022
diff --git a/ocrd_eval.md b/ocrd_eval.md
@@ -3,11 +3,11 @@
 ## Rationale
 
 Evaluating the quality of OCR requires comparing the OCR results on representative **ground truth** (GT)
-– i.e. realistic data (images) with manual transcriptions (segmentation, text). 
+– i.e. realistic data (images) with manual transcriptions (segmentation, text).
 OCR results can be obtained via several distinct **OCR workflows**.
 
 The comparison requires evaluation tools which themselves build on a number of established
-evaluation metrics. 
+evaluation metrics.
 The evaluation results must be presented in a way that allows factorising and localising aberrations,
 both within documents (page types, individual pages, region types, individual regions) and across classes of similar documents.
 
@@ -123,7 +123,7 @@ Currently we only provide CER per page; higher-level CER results might be calcul
 
 ##### Word Error Rate (WER)
 
-Word error rate (WER) is analogous to CER: While CER operates on (differences between) characters, 
+Word error rate (WER) is analogous to CER: While CER operates on (differences between) characters,
 WER measures the percentage of incorrectly recognized words in a text.
 
 A **word** in that context is usually defined as any sequence of characters between white space (including line breaks), with leading and trailing punctuation removed (according to [Unicode TR29 Word Boundary algorithm](http://unicode.org/reports/tr29/#Word_Boundaries)).
@@ -132,7 +132,7 @@ CER and WER share categories of errors, and the WER is similarly calculated:
 
 $WER = \frac{i_w + s_w + d_w}{i_w + s_w + d_w + c_w}$
 
-where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $c_w$ the number of correctl words.
+where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $c_w$ the number of correct words.
 
 More specific cases of WER consider only the "significant" words, omitting e.g. stopwords from the calculation.
 
@@ -152,15 +152,15 @@ Example:
 
 > Eine Mondfinsternis ist die Himmelsbegebenheit welche sich zur Zeit des Vollmondes ereignet, wenn die Erde zwischen der Sonne und dem Monde steht, so daß die Strahlen der Sonne von der Erde aufgehalten werden, und daß man so den Schatten der Erde in dem Monde siehet. In diesem Jahre sind zwey Monfinsternisse, davon ist ebenfalls nur Eine bey uns sichtbar, und zwar am 30sten März des Morgens nach 4 Uhr, und währt bis nach 6 Uhr.
 
-To get the Bag of Words of this paragraph a set containing each word and its number of occurence is created:
+To get the Bag of Words of this paragraph a multiset containing each word and its number of occurence is created:
 
-$BoW$ =
+$BoW_{GT}$ =
 
 ```json=
 {
     "Eine": 2, "Mondfinsternis": 1, "ist": 2, "die": 2, "Himmelsbegebenheit": 1, 
     "welche": 1, "sich": 1, "zur": 1,  "Zeit": 1, "des": 2, "Vollmondes": 1,
-    "ereignet,": 1, "wenn":1, "Erde": 3, "zwischen": 1, "der": 4, "Sonne": 2,
+    "ereignet,": 1, "wenn": 1, "Erde": 3, "zwischen": 1, "der": 4, "Sonne": 2,
     "und": 4, "dem": 2, "Monde": 2, "steht,": 1, "so": 2, "daß": 2, 
     "Strahlen": 1, "von": 1, "aufgehalten": 1, "werden,": 1, "man": 1, "den": 1, 
     "Schatten": 1, "in": 1, "siehet.": 1, "In": 1, "diesem": 1, "Jahre": 1, 
@@ -171,6 +171,28 @@ $BoW$ =
 }
 ```
 
+##### Bag of Words Metric
+
+The Bag of Words Metric describes how many words in a recognized text correspond to words given in the Ground Truth, independent of a page's layout.
+
+$BoW_m = \frac{BoW_{GT} - |\Delta_{GT/recognized}|}{|n_{GT}|}$
+
+###### Example
+
+Given
+
+$BoW_{GT} = \{"Eine": 1, "Mondfinsternis": 1, "steht": 1, "bevor": 1\}$
+
+and
+
+$BoW_{recognized} = \{"Eine": 1, "Mondfinsternis": 1, "fteht": 1, "bevor": 1\}$
+
+results in:
+
+$BoW_m = \frac{4 - 1}{4}$ = 0.75
+
+In this example 75% of the words have been correctly recognized.
+
 ### Layout Evaluation
 
 For documents with a complex structure, looking at the recognized text's accuracy alone is often insufficient to accurately determine the quality of OCR. An example can help to illustrate this: in a document containing two columns, all characters and words may be recognized correctly, but when the two columns are detected by layout analysis as just one, the OCR result will contain the text for the first lines of the first and second column, followed by the second lines of the first and second column asf., rendering the sequence of words and paragraphs in the Ground Truth text wrongly, which defeats almost all downstream processes.
@@ -257,13 +279,13 @@ The following metrics are not part of the MVP (minimal viable product) and will
 
 #### GPU Metrics
 
-##### GPU Avg Usage
+##### GPU Avg Memory
 
-GPU avg usage is the average GPU load during the execution of a workflow represented by a real number between 0 and 1.
+GPU avg memory refers to the average amount of memory of the GPU (in GiB) that was used during processing.
 
-##### GPU Peak Usage
+##### GPU Peak Memory
 
-GPU peak usage is the maximum GPU load during the execution of a workflow represented by a real number between 0 and 1.
+GPU peak memory is the maximum GPU memory allocated during the execution of a workflow in MB.
 
 #### Text Evaluation
 
@@ -273,12 +295,12 @@ TODO
 
 ##### Flexible Character Accuracy Measure
 
-The Flexible Character Accuracy (FCA) measure has been introduced to mitigate a major drawback of CER: 
+The Flexible Character Accuracy (FCA) measure has been introduced to mitigate a major drawback of CER:
 CER (if applied naively by comparing concatenated page-level texts) is heavily dependent on the reading order an OCR engine detects.
 Thus, where text blocks are rearranged or merged, no suitable text alignment can be made, so CER is very low,
 even if single characters, words and even lines have been perfectly recognized.
 
-FCA avoids this by splitting the recognized text and GT into lines and, if necessary, sub-line chunks, 
+FCA avoids this by splitting the recognized text and GT into lines and, if necessary, sub-line chunks,
 finding pairs that align maximally until only unmatched lines remain (which must be treated as errors),
 and measuring average CER of all pairs.
 
@@ -311,9 +333,9 @@ The following paragraphs will first introduce the intermediate concepts needed t
 
 ###### Precision and Recall
 
-**Precision** describes to which degree the predictions of a model are correct. 
+**Precision** describes to which degree the predictions of a model are correct.
 The higher the precision of a model, the more confidently we can assume that each prediction is correct
-(e.g. the model having identified a bicycle in an image actually depicts a bicycle). 
+(e.g. the model having identified a bicycle in an image actually depicts a bicycle).
 A precision of 1 (or 100%) indicates all predictions are correct (true positives) and no predictions are incorrect (false positives). The lower the precision value, the more false positives.
 
 In the context of object detection in images, it measures either
@@ -329,16 +351,18 @@ The higher the recall of a model, the more confidently we can assume that it cov
 A recall of 1 (or 100%) indicates that all objects have a correct prediction (true positives) and no predictions are missing or mislabelled (false negatives). The lower the recall value, the more false negatives.
 
 In the context of object detection in images, it measures either
-- the ratio of correctly detected segments over all actual segments, or
-- the ratio of correctly segmented pixels over the image size.
+
+* the ratio of correctly detected segments over all actual segments, or
+* the ratio of correctly segmented pixels over the image size.
 
 Notice that both goals are naturally conflicting each other. A good predictor needs both high precision and recall.
 But the optimal trade-off depens on the application.
 
 For layout analysis though, the underlying notion of sufficient overlap itself is inadequate:
-- it does not discern oversegmentation from undersegmentation
-- it does not discern splits/merges that are allowable (irrelevant w.r.t. text flow) or not (break up or conflate lines)
-- it does not discern foreground from background, or when partial overlap starts breaking character legibility or introducing ghost characters
+
+* it does not discern oversegmentation from undersegmentation
+* it does not discern splits/merges that are allowable (irrelevant w.r.t. text flow) or not (break up or conflate lines)
+* it does not discern foreground from background, or when partial overlap starts breaking character legibility or introducing ghost characters
 
 ###### Prediction Score
 
@@ -352,7 +376,7 @@ Whether this prediction is then considered to be a positive detection, depends o
 
 For object detection, the metrics precision and recall are usually defined in terms of a threshold for the degree of overlap
 (represented by the IoU as defined [above](#iou-intersection-over-union)), ranging between 0 and 1)
-above which pairs of detected and GT segments are qualified as matches. 
+above which pairs of detected and GT segments are qualified as matches.
 
 (Predictions that are non-matches across all GT objects – false positives – and GT objects that are non-matches across all predictions – false negatives – contribute indirectly in the denominator.)
 
@@ -364,7 +388,7 @@ Therefore, the union of that pair is more than double the intersection. But sinc
 
 ###### Precision-Recall Curve
 
-By varying the prediction threshold (and/or the IoU threshold), the tradeoff between precision and recall can be tuned. 
+By varying the prediction threshold (and/or the IoU threshold), the tradeoff between precision and recall can be tuned.
 When the full range of combinations has been gauged, the result can be visualised in a precision-recall curve (or receiver operator characteristic, ROC).
 Usually the optimum balance is where the product of precision and recall (i.e. area under the curve) is maximal.
 
@@ -388,8 +412,8 @@ This graph is called Precision-Recall-Curve.
 
 ###### Average Precision
 
-Average Precision (AP) describes how well (flexible and robust) a model can detect objects in an image, 
-by averaging precision over the full range (from 0 to 1) of confidence thresholds (and thus, recall results). 
+Average Precision (AP) describes how well (flexible and robust) a model can detect objects in an image,
+by averaging precision over the full range (from 0 to 1) of confidence thresholds (and thus, recall results).
 It is equal to the area under the Precision-Recall Curve.
 
 ![A sample precision/recall curve with highlighted area under curve](https://pad.gwdg.de/uploads/799e6a05-e64a-4956-9ede-440ac0463a3f.png)
@@ -412,7 +436,7 @@ AP &  = \displaystyle\sum_{k=0}^{k=n-1}[r(k) - r(k+1)] * p(k) \\
 \end{array}
 $$
 
-Usually, AP calculation also involves _smoothing_ (i.e. clipping local minima) and _interpolation_ (i.e. adding data points between the measured confidence thresholds).
+Usually, AP calculation also involves *smoothing* (i.e. clipping local minima) and *interpolation* (i.e. adding data points between the measured confidence thresholds).
 
 ###### Mean Average Precision
 
@@ -425,14 +449,15 @@ $mAP = \displaystyle\frac{1}{N}\sum_{i=1}^{N}AP_i$ with $N$ being the number of
 
 Often, this mAP for a range of IoU thresholds gets complemented by additional mAP runs for a set of fixed values, or for various classes and object sizes only.
 The common understanding is that those different measures collectively allow drawing better conclusions and comparisons about the model's quality.
+
 ##### Scenario-Driven Performance Evaluation
 
-Scenario-driven, layout-dedicated, text-flow informed performance evaluation as described in 
+Scenario-driven, layout-dedicated, text-flow informed performance evaluation as described in
 [Clausner et al., 2011](https://primaresearch.org/publications/ICDAR2011_Clausner_PerformanceEvaluation)
 is currently the most comprehensive and sophisticated approach to evaluate the quality of layout analysis.
 
 It is not a single metric, but comprises a multitude of measures derived in a unified method, which considers
-the crucial effects that segmentation can have on text flow, i.e. which kinds of overlaps (merges and splits) 
+the crucial effects that segmentation can have on text flow, i.e. which kinds of overlaps (merges and splits)
 amount to benign deviations (extra white-space) or pathological ones (breaking lines and words apart).
 In this approach, all the derived measures are aggregated under various sets of weights, called evaluation scenarios,
 which target specific use cases (like headline or keyword extraction, linear fulltext, newspaper or figure extraction).