Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deleted text regions from reading order, validate coords #15

Merged
merged 8 commits into from
Nov 12, 2019

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Nov 11, 2019

Supplant #4, fix #3.

@kba, The new validation behaviour could also be interesting for ocrd_validators in core.

wrznr and others added 8 commits August 15, 2019 13:43
PAGE XML requires all elements referenced in the reading order
to be actually present as a region. With this commit, deleted
zones are removed from the reading order element which itself
is reindexed afterwards.

Fixes OCR-D#3
- ReadingOrder can be more than just an OrderedGroup list of RegionRefIndexed:
  - OrderedGroup list of OrderedGroupIndexed
  - OrderedGroup list of UnoderedGroupIndexed
  - UnorderedGroup list of RegionRef
  - UnorderedGroup list of OrderedGroup
  - UnorderedGroup list of UnoderedGroup
  - for any *GroupIndexed: same options as
    the Group of the same name (recursive)
- so we need to query more members, and ask for their class
  in order to query for the right members in turn; also,
  we have to write back to the respective list with the
  masked regions removed; and (only) for ordered groups,
  we must re-index that list; finally, in the group case,
  we have to call recursively (which moreover necessitates
  separating this into a function)
@bertsky bertsky requested a review from wrznr November 11, 2019 22:37
@bertsky bertsky added bug Something isn't working enhancement New feature or request labels Nov 11, 2019
Copy link
Collaborator

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

LOG.info('Comparing regions "%s" and "%s"', regions[i].id, regions[j].id)
region_poly1 = Polygon(polygon_from_points(regions[i].get_Coords().points))
region_poly2 = Polygon(polygon_from_points(regions[j].get_Coords().points))
mark_for_deletion = list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why you move from set to list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment on that in the last commit: Because generateDS does not make element classes hashable.

if not isinstance(elem, (RegionRefType, RegionRefIndexedType)):
_plausibilize_group(regions, elem, mark_for_deletion)
for region in regions:
if (region in mark_for_deletion and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in O(1) with set but in O(n) with list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. It's a shame all our processors must be formulated so clumsily just because generateDS output is suboptimal. It will be hard to change that even after we start injecting custom code.

@bertsky bertsky merged commit faaf83f into OCR-D:master Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adjust reading order when deleting regions
2 participants