Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Spec #225

Merged
merged 57 commits into from
Mar 1, 2023
Merged
Changes from 8 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
e2d4038
wip
kba Sep 6, 2022
0cc4a0b
Merge remote-tracking branch 'origin/master' into qa-spec
kba Sep 14, 2022
5aa6bd5
rewrite eval schema and saple according to OCR-D/zenhub#123
kba Sep 21, 2022
6cd0caf
add metrics to ocrd_eval.md
kba Sep 21, 2022
b529531
ocrd_eval: \begin{array}{ll} instead of .. {2}
kba Sep 22, 2022
18333b8
style(ocrd_eval.md): linting, formatting and correcting images
mweidling Sep 26, 2022
fe9d6ff
stlye: add new line
mweidling Sep 26, 2022
d7854a1
Apply suggestions from code review
kba Sep 27, 2022
ee67881
Apply suggestions from code review
kba Sep 27, 2022
5b35358
retcon JSON changes to YAML
mweidling Sep 27, 2022
1aa048c
comment EvaluationMetrics back in
kba Sep 27, 2022
5840476
generate minimal JSON from YAML src
kba Sep 27, 2022
c9d313f
comment out undiscussed CER metrics
mweidling Sep 27, 2022
a814c89
feat: move workflow_steps to ocr_workflow object
mweidling Nov 24, 2022
a881e08
remove schema from this branch, cf. #236
kba Dec 19, 2022
c7ae88d
integrate Uwe's feedback
mweidling Jan 23, 2023
7ce6c1a
Update ocrd_eval.md
mweidling Jan 23, 2023
ef8aeea
Update ocrd_eval.md
mweidling Jan 23, 2023
e2d2ec9
Update ocrd_eval.md
mweidling Feb 3, 2023
ad97594
Update ocrd_eval.md
mweidling Feb 3, 2023
34a78cb
Update ocrd_eval.md
mweidling Feb 3, 2023
c95ce0b
Update ocrd_eval.md
mweidling Feb 3, 2023
13b2bcd
Update ocrd_eval.md
mweidling Feb 3, 2023
b823afc
Update ocrd_eval.md
mweidling Feb 3, 2023
8ab1391
Apply suggestions from code review
mweidling Feb 3, 2023
5e1da31
Apply suggestions from code review
mweidling Feb 3, 2023
0183ea9
Update ocrd_eval.md
mweidling Feb 3, 2023
5519120
update character definition wrt. white spaces
mweidling Feb 7, 2023
dd2d63b
refine paragraph about characters
mweidling Feb 9, 2023
0253720
move character section before edit distance section
mweidling Feb 9, 2023
19deddd
add placeholder for letter accuracy
mweidling Feb 9, 2023
149b271
fix link
mweidling Feb 9, 2023
7d0bbf6
Update ocrd_eval.md
mweidling Feb 9, 2023
851aeb7
Update ocrd_eval.md
mweidling Feb 9, 2023
c81079a
Update ocrd_eval.md
mweidling Feb 10, 2023
fab6202
Update ocrd_eval.md
mweidling Feb 10, 2023
f678050
implement feedback
mweidling Feb 10, 2023
a680c70
be more precise about CER/WER granularity
mweidling Feb 10, 2023
5e94aa0
change GPU metrics
mweidling Feb 10, 2023
e8dc864
change citation hint
mweidling Feb 10, 2023
ee330c9
adjust WER definition
mweidling Feb 10, 2023
9ea4b62
Apply suggestions from code review
mweidling Feb 10, 2023
c910f0e
add bow metric
mweidling Feb 14, 2023
8c22169
format document
mweidling Feb 14, 2023
149a2eb
gpu mem instead of util
mweidling Feb 14, 2023
2999ef4
Update ocrd_eval.md
mweidling Feb 14, 2023
87f9438
GPU Peak Memory definition
mweidling Feb 14, 2023
5e80c94
Update ocrd_eval.md
mweidling Feb 14, 2023
5cd5efb
Update ocrd_eval.md
mweidling Feb 15, 2023
492b6ee
Update ocrd_eval.md
mweidling Feb 15, 2023
d8d4cef
Update ocrd_eval.md
mweidling Feb 15, 2023
48e69f8
add letter accuracy
mweidling Feb 15, 2023
3cc5bee
rephrase layout eval intro
mweidling Feb 15, 2023
f817521
add reading order evaluation
mweidling Feb 15, 2023
04c5c27
implement Uwe's feedback reg. Letter Accuracy
mweidling Feb 15, 2023
d078b1b
Apply suggestions from code review
mweidling Feb 16, 2023
43b364a
eval: Improvements to TeX formulas
kba Feb 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 63 additions & 27 deletions ocrd_eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,10 @@ At this stage (Q3 2022) these definitions serve as a basis of common understandi

### Text Evaluation

The most important measure to assess the quality of OCR is the accuracy of the recognized text. The majority of metrics for this are based on the Levenshtein distance, an algorithm to compute the distance between two strings. In OCR, one of these strings is generally the Ground Truth text and the other the recognized text which is the result of an OCR.
The most important measure to assess the quality of OCR is the accuracy of the recognized text.
The majority of metrics for this are based on the Levenshtein distance, an algorithm to compute the distance between two strings.
In OCR, one of these strings is generally the Ground Truth text and the other the recognized text which is the result of an OCR.
The text is concatenated at page level from smaller constituents in reading order.

#### Characters

Expand Down Expand Up @@ -110,11 +113,11 @@ The *normalized* CER avoids this effect by considering the number of correct cha

$CER_n = \frac{i + s+ d}{i + s + d + c}$

In OCR-D's benchmarking we calculate the *non-normalized* CER where values over 1 should be read as 100%.
In OCR-D's benchmarking we calculate the *normalized* CER where values naturally range between 0 and 100%.

###### CER Granularity

In OCR-D we distinguish between the CER per **page** and the **overall** CER of a text. The reasoning behind this is that the material OCR-D mainly aims at (historical prints) is very heterogeneous: Some pages might have an almost simplistic layout while others can be highly complex and difficult to process. Providing only an overall CER would cloud these differences between pages.
In OCR-D we distinguish between the CER per **page** and the **overall** CER of a document. The reasoning behind this is that the material OCR-D mainly aims at (historical prints) is very heterogeneous: Some pages might have an almost simplistic layout while others can be highly complex and difficult to process. Providing only an overall CER would cloud these differences between pages.

Currently we only provide CER per page; higher-level CER results might be calculated as a weighted aggregate at a later stage.

Expand All @@ -127,21 +130,21 @@ A **word** in that context is usually defined as any sequence of characters betw

CER and WER share categories of errors, and the WER is similarly calculated:

$WER = \frac{i_w + s_w + d_w}{n_w}$
$WER = \frac{i_w + s_w + d_w}{i_w + s_w + d_w + c_w}$

where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $n_w$ the total number of words.
where $i_w$ is the number of inserted, $s_w$ the number of substituted, $d_w$ the number of deleted and $c_w$ the number of correctl words.
mweidling marked this conversation as resolved.
Show resolved Hide resolved

More specific cases of WER consider only the "significant" words, omitting e.g. stopwords from the calculation.

###### WER Granularity

In OCR-D we distinguish between the WER per **page** and the **overall** WER of a text. The reasoning here follows the one of CER granularity.
In OCR-D we distinguish between the WER per **page** and the **overall** WER of a document. The reasoning here follows the one of CER granularity.

Currently we only provide WER per page; higher-level WER results might be calculated at a later stage.

#### Bag of Words

In the "Bag of Words" model a text is represented as a set of its word irregardless of word order or grammar; Only the words themselves and their number of occurence are considered.
In the "Bag of Words" (BaW) model, a text is represented as a multiset of the words (as defined in the previous section) it contains, regardless of their order.

Example:

Expand Down Expand Up @@ -218,7 +221,7 @@ CPU time is the time taken by the CPU(s) on the processors. It does not include

#### Wall Time

Wall time (or elapsed time) is the time taken by a processor to process an instruction including idle time.
Wall-clock time (or elapsed time) is the time taken on the processors including idle time but ignoring concurrency.

#### I/O

Expand Down Expand Up @@ -254,13 +257,13 @@ The following metrics are not part of the MVP (minimal viable product) and will

#### GPU Metrics

##### GPU Time
##### GPU Avg Usage

GPU time is the time a GPU (graphics card) spent processing instructions
GPU avg usage is the average GPU load during the execution of a workflow represented by a real number between 0 and 1.

##### GPU Avg Memory
##### GPU Peak Usage
mweidling marked this conversation as resolved.
Show resolved Hide resolved

GPU avg memory refers to the average amount of memory of the GPU (in GiB) that was used during processing.
GPU peak usage is the maximum GPU load during the execution of a workflow represented by a real number between 0 and 1.
mweidling marked this conversation as resolved.
Show resolved Hide resolved

#### Text Evaluation

Expand All @@ -281,17 +284,20 @@ and measuring average CER of all pairs.

The algorithm can be summarized as follows:

> 1. Split the two input texts into text lines
> 2. Sort the ground truth text lines by length (in descending order)
> 3. For the first ground truth line, find the best matching OCR result line segment (by minimising a penalty that is partly based on string edit distance)
> 4. If full match (full length of line)
> a. Mark as done and remove line from list
> b. Else subdivide and add to respective list of text lines; resort
> 1. Split both input texts into text lines
> 2. Sort the GT lines by length
> (in descending order)
> 3. For the top GT line, find the best fully or partially matching OCR line
> (by lowest edit distance and highest coverage)
> 4. If full match (i.e. full length of line)
> a. Mark as done and remove line from both lists
> b. Else mark matching part as done,
> then cut off unmatched part and add to respective list of text lines; resort
> 5. If any more lines available repeat step 3
> 6. Count non-matched lines / strings as insertions or deletions (depending on origin: ground truth or result)
> 7. Sum up all partial edit distances and calculate overall character accuracy
> 6. Count remaining unmatched lines as insertions or deletions (depending on origin – GT or OCR)
> 7. Calculate the (micro-)average CER of all marked pairs and return as overall FCER

(C. Clausner, S. Pletschacher and A. Antonacopoulos / Pattern Recognition Letters 131 (2020) 390–397, p. 392)
(paraphrase of C. Clausner, S. Pletschacher and A. Antonacopoulos / Pattern Recognition Letters 131 (2020) 390–397, p. 392)

#### Layout Evalutation

Expand All @@ -317,22 +323,50 @@ In the context of object detection in images, it measures either
* the ratio of correctly segmented pixels over the image size
(assuming all predictions can be combined into some coherent segmentation).

**Recall**, on the other hand, measures how well a model performs in finding all instances of an object in an image (true positives), irregardless of false positives. Given a model tries to identify bicycles in an image, a recall of 1 indicates that all bicycles have been found by the model (while not considering other objects that have been falsely labelled as a bicycle).
**Recall**, on the other hand, describes to which degree a model predicts what is actually present.
The higher the recall of a model, the more confidently we can assume that it covers everything to be found
(e.g. the model having identified every bicycle, car, person etc. in an image).
A recall of 1 (or 100%) indicates that all objects have a correct prediction (true positives) and no predictions are missing or mislabelled (false negatives). The lower the recall value, the more false negatives.

In the context of object detection in images, it measures either
- the ratio of correctly detected segments over all actual segments, or
- the ratio of correctly segmented pixels over the image size.

Notice that both goals are naturally conflicting each other. A good predictor needs both high precision and recall.
But the optimal trade-off depens on the application.

For layout analysis though, the underlying notion of sufficient overlap itself is inadequate:
- it does not discern oversegmentation from undersegmentation
- it does not discern splits/merges that are allowable (irrelevant w.r.t. text flow) or not (break up or conflate lines)
- it does not discern foreground from background, or when partial overlap starts breaking character legibility or introducing ghost characters

###### Prediction Score

When a model tries to identify objects in an image, it predicts that a certain area in an image represents said object with a certain confidence or prediction score. The prediction score varies between 0 and 1 and represents the percentage of certainty of having correctly identified an object. Given a model tries to identify ornaments on a page. If the model returns an area of a page with a prediction score of 0.6, the model is "60% sure" that this area is an ornament. If this area is then considered to be a positive, depends on the chosen threshold.
Most types of model can output a confidence score alongside each predicted object,
which represents the model's certainty that the prediction is correct.
For example, when a model tries to identify ornaments on a page, if it returns a segment (polygon / mask)
with a prediction score of 0.6, the model asserts there is a 60% probability that there is an ornament at that location.
Whether this prediction is then considered to be a positive detection, depends on the chosen threshold.

###### IoU Thresholds

###### Thresholds
For object detection, the metrics precision and recall are usually defined in terms of a threshold for the degree of overlap
(represented by the IoU as defined [above](#iou-intersection-over-union)), ranging between 0 and 1)
above which pairs of detected and GT segments are qualified as matches.

A threshold is a freely chosen number between 0 and 1. It divides the output of a model into two groups: Outputs that have a prediction score or IoU greater than or equal to the threshold represent an object. Outputs with a prediction score or IoU below the threshold are discarded as not representing the object.
(Predictions that are non-matches across all GT objects – false positives – and GT objects that are non-matches across all predictions – false negatives – contribute indirectly in the denominator.)

Example:
Given a threshold of 0.6 and a model that tries to detect bicycles in an image. The model returns two areas in an image that might be bicycles, one with a prediction score of 0.4 and one with 0.9. Since the threshold equals 0.6, the first area is tossed and not regarded as bicycle while the second one is kept and counted as recognized.
Given a prediction threshold of 0.8, an IoU threshold of 0.6 and a model that tries to detect bicycles in an image which depicts two bicycles.
The model returns two areas in an image that might be bicycles, one with a confidence score of 0.4 and one with 0.9. Since the prediction threshold equals 0.8, the first candidate gets immediately tossed out. The other
is compared to both bicycles in the GT. One GT object is missed (false negative), the other intersects the remaining prediction, but the latter is twice as large.
Therefore, the union of that pair is more than double the intersection. But since the IoU threshold equals 0.6, even the second candidate is not regarded as a match and thus also counted as false negative. Overall, both precision and recall are zero (becaue 1 kept prediction is a false positive and 2 GTs are false negatives).

###### Precision-Recall Curve

Precision and recall are connected to each other since both depend on the true positives detected. A precision-recall-curve is a means to balance these values while maximizing them.
By varying the prediction threshold (and/or the IoU threshold), the tradeoff between precision and recall can be tuned.
When the full range of combinations has been gauged, the result can be visualised in a precision-recall curve (or receiver operator characteristic, ROC).
Usually the optimum balance is where the product of precision and recall (i.e. area under the curve) is maximal.

Given a dataset with 100 images in total of which 50 depict a bicycle. Also given a model trying to identify bicycles on images. The model is run 7 times using the given dataset while gradually increasing the threshold from 0.1 to 0.7.

Expand Down Expand Up @@ -378,6 +412,8 @@ AP & = \displaystyle\sum_{k=0}^{k=n-1}[r(k) - r(k+1)] * p(k) \\
\end{array}
$$

mweidling marked this conversation as resolved.
Show resolved Hide resolved
Usually, AP calculation also involves _smoothing_ (i.e. clipping local minima) and _interpolation_ (i.e. adding data points between the measured confidence thresholds).

###### Mean Average Precision

Mean Average Precision (mAP) is a metric used to measure the full potential of an object detector over various conditions.
Expand Down