more geometry heuristics for validate/repair #5

bertsky · 2019-06-25T11:45:00Z

We should have heuristics to check for

polygon containment (overlapping regions, word outside line etc.)
artifacts from annotation like point or line-like regions
lines with (way) too much whitespace (bad cropping, or bad segmentation)
probably even: missing @orientation

Originally posted by @kba in OCR-D/assets#28 (comment)

The text was updated successfully, but these errors were encountered:

bertsky · 2019-06-26T08:47:01Z

BTW, shapely.geometry.polygon.Polygon has very nice API for the first 2 tasks, including contains() and area().

The third could be achieved with ad-hoc binarization and some simple Numpy statistics like count_nonzero() (i.e. pixel-counting), or nonzero() followed by amin() and amax() to get non-white bounds (i.e. area-counting).

And orientation checking could be done in a similar way like deskewing (i.e. entropy based), but with some kind of confidence measure.

bertsky · 2019-07-18T12:59:26Z

A good reference for additional checks are the validation error classes in Aletheia, p. 118/119.

kba · 2019-08-07T10:49:31Z

c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation

bertsky · 2019-08-14T07:08:41Z

c.f. https://github.com/OCR-D/ocrd_evaluate_segmentation

now renamed to https://github.com/OCR-D/ocrd_segment (there will be more processors)

kba · 2019-08-15T09:47:21Z

https://github.com/OCR-D/ocrd_segment is a better place for this.

bertsky · 2019-08-15T21:39:41Z

Moved the original issue from core here to have a better reminder of what is left to do.

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

bertsky · 2019-08-16T16:48:28Z

(We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

bertsky · 2019-11-12T08:44:20Z

Out of the original list, we are still somewhere in the first item I think. (We do not yet check whether elements are properly contained within their parents' outline.)

And the question then is, how does repair look like in that case? Shrink the element's polygon or xtend the parent's polygon?

With #15 we now have covered the first item, except for repair. So far, we can only repair:

overlapping regions (with plausibilize=True) when near-equal or properly contained
(but not near-contained or partial overlap)
lines extending from regions (with sanitize=True) by overwriting the region polygon with a hull of the lines
(but not the other way, and not on the other levels)

wrznr · 2019-11-12T08:55:13Z

Partial Overlap of region a and b

Merge a and b if of same type
Shrink b to non-overlapping part (i.e. difference) if a is of type text
Vice versa b
Else?

bertsky · 2019-11-12T09:56:26Z

Partial Overlap of region a and b
1. Merge `a` and `b` if of same type

Yes, but for text regions we would need to bring in the concept of Allowable Merge (w.r.t. ReadingOrder and @readingDirection|@readingOrientation) first:

A merge is allowed iff a and b are direct successors in the reading order, and they have equal reading direction, and its axis (i.e. horizontal vs vertical) is orthogonal to the axis on which both bounding boxes deviate most.

And if a merge is not allowed between two overlapping text regions, then the intersecting foreground should somehow fall into that region which it is most consistent with (i.e. regarding its alignment and center of mass).

Shrink b to non-overlapping part (i.e. difference) if a is of type text

Vice versa b

Else?

If a and b are of different, both non-text type, I'd say it does not matter.

BTW, do we want to go into the complexities of using PAGE-XML's Layers? (Then we could avoid changing the coordinates altogether, and would merely have to decide on @zIndex ordering...

wrznr · 2019-11-12T10:19:03Z

Layers

I fear this implies drastic changes to core. Let's better do not for now.

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary! Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

bertsky · 2019-11-12T10:35:07Z

Layers

I fear this implies drastic changes to core. Let's better do not for now.

Agreed. (The way this is formalised in PAGE-XML, it would still be impossible to separate/suppress foreground automatically.)

ReadingOrder

We have to distinguish here: Right now, we do not have any RO computation. It is more or less arbitrary!

I disagree. Even if we don't know the reading order, that's a separate problem. No RO equals default RO (i.e. XML element order), right? Whatever the RO in the document, the repair decision always depends on it.

Maybe if the DFKI guys deliver this will change. I think we should sanitize and fix the RO ad hoc.

Fixing RO is another problem/step. And especially when we have overlapping regions, this becomes circular if all we can do is heuristics.

IMHO a good RO detection would have to be data-driven, and informed by the precise @type (and possible @custom sub-type) of the regions.

wrznr · 2019-11-12T11:08:38Z

No RO equals default RO

Actually, I think that indeed RO = default RO. But your right, we should not base hacks on hacks.

bertsky · 2019-11-12T12:34:05Z

No RO equals default RO

But your right, we should not base hacks on hacks.

Well, or maybe just a little: Let's say we have a region segmentation like Tesseract that can output reading direction within regions (via orientation analysis), but is really bad on reading order between regions – creating XML elements more or less in random order. (The same could happen with a NN module without RO.)

Now strictly when repairing we would be unable to merge or split most of the time (because 2 neighbouring/overlapping regions are XML successors only by chance). But we could still repair the unambiguous cases if we first added a new RO based on a top-down-left-to-right assumption (treating overlapping regions as neighbours), ... I think. At least as an extra option for the desparate.

EEngl52 assigned kba Aug 7, 2019

kba closed this as completed Aug 15, 2019

bertsky transferred this issue from OCR-D/core Aug 15, 2019

bertsky changed the title ~~extend PAGE validator with geometry heuristics~~ more geometry heuristics for validate/repair Aug 15, 2019

bertsky assigned wrznr and bertsky Aug 15, 2019

bertsky added the enhancement New feature or request label Aug 15, 2019

bertsky reopened this Aug 15, 2019

bertsky mentioned this issue Nov 26, 2019

plausibilize and sanitize are too broad terms #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more geometry heuristics for validate/repair #5

more geometry heuristics for validate/repair #5

bertsky commented Jun 25, 2019

bertsky commented Jun 26, 2019

bertsky commented Jul 18, 2019

kba commented Aug 7, 2019

bertsky commented Aug 14, 2019

kba commented Aug 15, 2019

bertsky commented Aug 15, 2019

bertsky commented Aug 16, 2019

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019 •

edited

Loading

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019

bertsky commented Nov 12, 2019 •

edited

Loading

more geometry heuristics for validate/repair #5

more geometry heuristics for validate/repair #5

Comments

bertsky commented Jun 25, 2019

bertsky commented Jun 26, 2019

bertsky commented Jul 18, 2019

kba commented Aug 7, 2019

bertsky commented Aug 14, 2019

kba commented Aug 15, 2019

bertsky commented Aug 15, 2019

bertsky commented Aug 16, 2019

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019 • edited Loading

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019

bertsky commented Nov 12, 2019

wrznr commented Nov 12, 2019

bertsky commented Nov 12, 2019 • edited Loading

wrznr commented Nov 12, 2019 •

edited

Loading

bertsky commented Nov 12, 2019 •

edited

Loading