-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deleted text regions from reading order, validate coords #15
Conversation
PAGE XML requires all elements referenced in the reading order to be actually present as a region. With this commit, deleted zones are removed from the reading order element which itself is reindexed afterwards. Fixes OCR-D#3
- ReadingOrder can be more than just an OrderedGroup list of RegionRefIndexed: - OrderedGroup list of OrderedGroupIndexed - OrderedGroup list of UnoderedGroupIndexed - UnorderedGroup list of RegionRef - UnorderedGroup list of OrderedGroup - UnorderedGroup list of UnoderedGroup - for any *GroupIndexed: same options as the Group of the same name (recursive) - so we need to query more members, and ask for their class in order to query for the right members in turn; also, we have to write back to the respective list with the masked regions removed; and (only) for ordered groups, we must re-index that list; finally, in the group case, we have to call recursively (which moreover necessitates separating this into a function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
LOG.info('Comparing regions "%s" and "%s"', regions[i].id, regions[j].id) | ||
region_poly1 = Polygon(polygon_from_points(regions[i].get_Coords().points)) | ||
region_poly2 = Polygon(polygon_from_points(regions[j].get_Coords().points)) | ||
mark_for_deletion = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why you move from set
to list
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the comment on that in the last commit: Because generateDS does not make element classes hashable.
if not isinstance(elem, (RegionRefType, RegionRefIndexedType)): | ||
_plausibilize_group(regions, elem, mark_for_deletion) | ||
for region in regions: | ||
if (region in mark_for_deletion and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in O(1)
with set
but in O(n)
with list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. It's a shame all our processors must be formulated so clumsily just because generateDS output is suboptimal. It will be hard to change that even after we start injecting custom code.
Supplant #4, fix #3.
@kba, The new validation behaviour could also be interesting for
ocrd_validators
in core.