Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageType.get_AllRegions to list all kinds of regions #479

Merged
merged 36 commits into from
Jun 4, 2020
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
abef352
PageType.get_AllRegions to list all kinds of regions
kba May 13, 2020
3445f87
Update ocrd_models/ocrd_page_user_methods.py
bertsky May 14, 2020
a48b8c1
update generateds page, add region filter if using reading order, wip
kba May 14, 2020
f51a2e4
Merge branch 'hotfix-ocrd-page-exports' into get-all-regions
kba May 14, 2020
8da3f3c
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 14, 2020
d2a01bb
refactoring: move generateDS methods to their own files
kba May 15, 2020
be7f026
get_AllRegions: adapt to signature proposed in #240, test with order=…
kba May 15, 2020
e1740f7
README: explain how to add user methods to PAGE API
kba May 15, 2020
6f9163e
Update ocrd_models/README.md
kba May 28, 2020
0c73b3e
Update ocrd_models/README.md
kba May 28, 2020
5c2f3a8
Update ocrd_models/README.md
kba May 28, 2020
6a57506
recursion (with both finite or arbitrary depth) for get_AllRegions
kba May 28, 2020
a9072c8
regenerate PAGE API
kba May 28, 2020
ac62b85
get_AllRegions: clean-up merge artifacts and reorganize
kba May 28, 2020
fd6d545
Update ocrd_models/ocrd_page_user_methods/get_AllRegions.py
kba May 28, 2020
86a7133
get_AllRegions: _region_id method unneccessary now
kba May 28, 2020
ce06392
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 28, 2020
5c8d89b
regenerate PAGE API
kba May 28, 2020
f6e3da5
:art: pylint
kba May 28, 2020
8351056
add_AllIndexed -> extend_AllIndexed
kba May 28, 2020
f202205
get_AllRegions: differentiate "reading-order"/"reading-order-only"
kba May 28, 2020
ffba6f9
get_AllRegions: catch negative depth, test depth==0
kba May 29, 2020
207f396
:memo: get_AllRegions: document example
bertsky May 29, 2020
9ced315
get_AllRegions: fix recursion
kba May 29, 2020
629f38d
get_AllRegions: Update example
kba May 29, 2020
e958559
wip
kba May 29, 2020
1964563
reading order test sample: add unorderedgroups for testing
kba May 29, 2020
27e256f
add get_UnorderedGroupChildren, let get_AllIndexed handle UnorderedGr…
kba May 29, 2020
1b17e3f
get_AllIndexed: allow filtering by child type
kba May 29, 2020
ae613cf
get_AllIndexed: index_sort parameter to enable/disable sorting
kba May 29, 2020
b1df95f
add sort_AllIndexed to sort in-place
kba May 29, 2020
fd9dc83
extend_AllIndexed: increment @index when adding elements
kba May 29, 2020
9d0e539
Merge branch 'master' into get-all-regions
kba May 29, 2020
84f1d33
:memo: changelog
kba May 29, 2020
0e14633
Document extend_AllIndexed validate_contiunuity param
kba Jun 3, 2020
b79474a
Merge branch 'master' into get-all-regions
kba Jun 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,22 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

Added:

* OcrdPage: `get_AllRegions`: retrieve all regions, sorted by document or reading order, #479
* OcrdPage: `sort_AllIndexed`: sort all children by `@index` in-place
* OcrdPage: `clear_AllIndexed`: clear all `@index` children
* OcrdPage: `extend_AllIndexed`: Add elements with incrementing `@index`
* OcrdPage: Replace empty reading order groups with equivalent `RegionRef` on export
* OcrdPage: `get_UnorderedGroupChildren`: get reading order elements of an `UnorderedGroup`


Changed:

* OcrdPage: `get_AllIndexed`: allow filtering by child type
* OcrdPage: `get_AllIndexed`: index_sort parameter to enable/disable sorting


## [2.7.1] - 2020-05-27

Fixed:
Expand Down
31 changes: 31 additions & 0 deletions ocrd_models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,34 @@
> OCR-D framework - file format APIs and schemas

See https://github.com/OCR-D/core

## Adding user methods to the generated PAGE API

Let's say you want to add a method `get_FirstTextRegion` on the pc:Page element:

1. Create a file `ocrd_models/ocrd_page_user_methods/get_FirstTextRegion.py`

```python
def get_FirstTextRegion(self):
return self.get_TextRegion[0]
```

kba marked this conversation as resolved.
Show resolved Hide resolved
**NOTE** The method name and file name must be identical.

**NOTE** Do not use Python's `%` string interpolation operator, it will break generateDS. Use `"".format(...)` instead.

2. Edit `ocrd_models/ocrd_page_user_methods.py` and append to the `METHOD_SPECS` list:

```python
METHOD_SPECS = (
# ...
_add_method(r'^PageType$', 'get_FirstTextRegion')
# ...
)
```

3. Regenerate the PAGE API:

```sh
make generate-page
```
322 changes: 261 additions & 61 deletions ocrd_models/ocrd_models/ocrd_page_generateds.py

Large diffs are not rendered by default.

101 changes: 19 additions & 82 deletions ocrd_models/ocrd_page_user_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
# source: https://bitbucket.org/dkuhlman/generateds/src/default/gends_user_methods.py

import re
import codecs
from os.path import dirname, join

#
# You must include the following class definition at the top of
Expand Down Expand Up @@ -80,94 +82,29 @@ def show(self):
# generated superclass file and also section "User Methods" in
# the documentation, as well as the examples below.

#
# Replace the following method specifications with your own.

#
# List all *Indexed children sorted by @index
#
get_AllIndexed = MethodSpec(name='get_AllIndexed',
source=r'''
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index) ''', class_names=r'^(OrderedGroupType|OrderedGroupIndexedType)$')

#
# Clear all *Indexed children sorted by @index
#
clear_AllIndexed = MethodSpec(name='clear_AllIndexed',
source=r'''
def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret
''', class_names=r'^(OrderedGroupType|OrderedGroupIndexedType)$')

#
# Add all *Indexed children sorted by @index
#
add_AllIndexed = MethodSpec(name='add_AllIndexed',
source=r'''
def add_AllIndexed(self, elements):
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x : x.index):
if isinstance(element, RegionRefIndexedType):
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType):
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType):
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()
''', class_names=r'^(OrderedGroupType|OrderedGroupIndexedType)$')
def _add_method(class_re, method_name):
kba marked this conversation as resolved.
Show resolved Hide resolved
"""
Loads a file ./ocrd_page_user_methods/{{ method_name }}.py and defines a MethodSpec applying to class_re
"""
source = []
with codecs.open(join(dirname(__file__), 'ocrd_page_user_methods', '%s.py' % method_name)) as f:
for line in f.readlines():
source.append(' %s' % line if line else line)
return MethodSpec(name=method_name, class_names=class_re, source=''.join(source))


#
# export children sorted by index of the childelement
#
exportChildren = MethodSpec(name='exportChildren',
source=r'''
def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True):
eol_ = '\n' if pretty_print else ''
namespaceprefix_ = 'pc:'
if self.UserDefined is not None:
self.UserDefined.export(outfile, level, namespaceprefix_, namespacedef_='', name_='UserDefined', pretty_print=pretty_print)
for Labels_ in self.Labels:
Labels_.export(outfile, level, namespaceprefix_, namespacedef_='', name_='Labels', pretty_print=pretty_print)
cleaned = []
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed():
rri = RegionRefIndexedType.factory(parent_object_=self)
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
else:
cleaned.append(entry)
for entry in cleaned:
entry.export(outfile, level, namespaceprefix_, namespacedef_='', name_=entry.__class__.__name__[:-4], pretty_print=pretty_print)
''', class_names=r'^(OrderedGroupType|OrderedGroupIndexedType)$')
#
# Hash by memory adress/id()
#
hash_by_id = MethodSpec(name='hash',
source='''\
def __hash__(self):
return hash(self.id)
''',
class_names=r'^.*$',
)
#
# Provide a list of your method specifications.
# This list of specifications must be named METHOD_SPECS.
#
METHOD_SPECS = (
hash_by_id,
exportChildren,
get_AllIndexed,
add_AllIndexed,
clear_AllIndexed,
_add_method(r'^.*$', '__hash__'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'get_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'clear_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'extend_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'sort_AllIndexed'),
_add_method(r'^(OrderedGroupType|OrderedGroupIndexedType)$', 'exportChildren'),
_add_method(r'^(UnorderedGroupType|UnorderedGroupIndexedType)$', 'get_UnorderedGroupChildren'),
_add_method(r'^(PageType)$', 'get_AllRegions'),
)


Expand Down
2 changes: 2 additions & 0 deletions ocrd_models/ocrd_page_user_methods/__hash__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
def __hash__(self):
return hash(self.id)
7 changes: 7 additions & 0 deletions ocrd_models/ocrd_page_user_methods/clear_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret

24 changes: 24 additions & 0 deletions ocrd_models/ocrd_page_user_methods/exportChildren.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True): # pylint: disable=unused-argument,too-many-arguments
namespaceprefix_ = 'pc:'
if self.UserDefined is not None:
self.UserDefined.export(outfile, level, namespaceprefix_, namespacedef_='', name_='UserDefined', pretty_print=pretty_print)
for Labels_ in self.Labels:
Labels_.export(outfile, level, namespaceprefix_, namespacedef_='', name_='Labels', pretty_print=pretty_print)
cleaned = []
def replaceWithRRI(group):
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = group.index
rri.regionRef = group.regionRef
cleaned.append(rri)
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
# pylint: disable=undefined-variable
if isinstance(entry, (OrderedGroupIndexedType)) and not entry.get_AllIndexed():
replaceWithRRI(entry)
elif isinstance(entry, UnorderedGroupIndexedType) and not entry.get_UnorderedGroupChildren():
replaceWithRRI(entry)
else:
cleaned.append(entry)
for entry in cleaned:
entry.export(outfile, level, namespaceprefix_, namespacedef_='', name_=entry.__class__.__name__[:-4], pretty_print=pretty_print)
27 changes: 27 additions & 0 deletions ocrd_models/ocrd_page_user_methods/extend_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def extend_AllIndexed(self, elements, validate_continuity=False):
"""
Add all elements in list ``elements``, respecting ``@index`` order.
"""
kba marked this conversation as resolved.
Show resolved Hide resolved
if not isinstance(elements, list):
elements = [elements]
siblings = self.get_AllIndexed()
highest_sibling_index = siblings[-1].index if siblings else -1
if validate_continuity:
elements = sorted(elements, key=lambda x: x.index)
lowest_element_index = elements[0].index
if lowest_element_index <= highest_sibling_index:
raise Exception("@index already used: {}".format(lowest_element_index))
else:
for element in elements:
highest_sibling_index += 1
element.index = highest_sibling_index
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

17 changes: 17 additions & 0 deletions ocrd_models/ocrd_page_user_methods/get_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# pylint: disable=invalid-name,missing-module-docstring,line-too-long
def get_AllIndexed(self, classes=None, index_sort=True):
"""
Get all indexed children sorted by their ``@index``.

Arguments:
classes (list): Type of children to return. Default: ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
index_sort (boolean): Whether to sort by ``@index``
"""
if not classes:
classes = ['RegionRef', 'OrderedGroup', 'UnorderedGroup']
ret = []
for class_ in classes:
ret += getattr(self, 'get_{}Indexed'.format(class_))()
if index_sort:
return sorted(ret, key=lambda x: x.index)
return ret
86 changes: 86 additions & 0 deletions ocrd_models/ocrd_page_user_methods/get_AllRegions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# pylint: disable=line-too-long,invalid-name,protected-access,missing-module-docstring
def _region_class(self, x): # pylint: disable=unused-argument
return x.__class__.__name__.replace('RegionType', '')

def _get_recursive_regions(self, regions, level, classes=None):
if level == 1:
# stop recursion, filter classes
if classes:
return [r for r in regions if self._region_class(r) in classes]
if regions and regions[0].__class__.__name__ == 'PageType':
regions = regions[1:]
return regions
# find more regions recursively
more_regions = []
for region in regions:
more_regions.append([])
for class_ in ['Advert', 'Chart', 'Chem', 'Custom', 'Graphic', 'Image',
'LineDrawing', 'Map', 'Maths', 'Music', 'Noise',
'Separator', 'Table', 'Text', 'Unknown']:
if class_ == 'Map' and not isinstance(region, PageType): # pylint: disable=undefined-variable
# 'Map' is not recursive in 2019 schema
continue
more_regions[-1] += getattr(region, 'get_{}Region'.format(class_))()
if not any(more_regions):
return self._get_recursive_regions(regions, 1, classes)
ret = []
for r, more in zip(regions, more_regions):
ret.append(r)
ret += self._get_recursive_regions(more, level - 1 if level else 0, classes)
return self._get_recursive_regions(ret, 1, classes)

def _get_recursive_reading_order(self, rogroup):
if isinstance(rogroup, (OrderedGroupType, OrderedGroupIndexedType)): # pylint: disable=undefined-variable
elements = rogroup.get_AllIndexed()
if isinstance(rogroup, (UnorderedGroupType, UnorderedGroupIndexedType)): # pylint: disable=undefined-variable
elements = (rogroup.get_RegionRef() + rogroup.get_OrderedGroup() + rogroup.get_UnorderedGroup())
regionrefs = list()
for elem in elements:
regionrefs.append(elem.get_regionRef())
if not isinstance(elem, (RegionRefType, RegionRefIndexedType)): # pylint: disable=undefined-variable
regionrefs.extend(self._get_recursive_reading_order(elem))
return regionrefs

def get_AllRegions(self, classes=None, order='document', depth=0):
"""
Get all the *Region element or only those provided by ``classes``.
Returned in document order unless ``order`` is ``reading-order``
Arguments:
classes (list) Classes of regions that shall be returned, e.g. ``['Text', 'Image']``
order ("document"|"reading-order"|"reading-order-only") Whether to
return regions sorted by document order (``document``, default) or by
reading order with regions not in the reading order at the end of the
returned list (``reading-order``) or regions not in the reading order
omitted (``reading-order-only``)
depth (int) Recursive depth to look for regions at, set to `0` for all regions at any depth. Default: 0

For example, to get all text anywhere on the page in reading order, use:
::
'\\n'.join(line.get_TextEquiv()[0].Unicode
for region in page.get_AllRegions(classes=['Text'], depth=0, order='reading-order')
for line in region.get_TextLine())
"""
if order not in ['document', 'reading-order', 'reading-order-only']:
raise Exception("Argument 'order' must be either 'document', 'reading-order' or 'reading-order-only', not '{}'".format(order))
if depth < 0:
raise Exception("Argument 'depth' must be an integer greater-or-equal 0, not '{}'".format(depth))
ret = self._get_recursive_regions([self], depth + 1 if depth else 0, classes)
if order.startswith('reading-order'):
reading_order = self.get_ReadingOrder()
if reading_order:
reading_order = reading_order.get_OrderedGroup() or reading_order.get_UnorderedGroup()
if reading_order:
reading_order = self._get_recursive_reading_order(reading_order)
if reading_order:
id2region = {region.id: region for region in ret}
in_reading_order = [id2region[region_id] for region_id in reading_order if region_id in id2region]
# print("ret: {} / in_ro: {} / not-in-ro: {}".format(
# len(ret),
# len([id2region[region_id] for region_id in reading_order if region_id in id2region]),
# len([r for r in ret if r not in in_reading_order])
# ))
if order == 'reading-order-only':
ret = in_reading_order
else:
ret = in_reading_order + [r for r in ret if r not in in_reading_order]
return ret
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
def get_UnorderedGroupChildren(self):
"""
List all non-metadata children of an UnorderedGroup
"""
# TODO: should not change order
kba marked this conversation as resolved.
Show resolved Hide resolved
return self.get_RegionRef() + self.get_OrderedGroup() + self.get_UnorderedGroup()

16 changes: 16 additions & 0 deletions ocrd_models/ocrd_page_user_methods/sort_AllIndexed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# pylint: disable=line-too-long,invalid-name,missing-module-docstring
def sort_AllIndexed(self, validate_uniqueness=True):
"""
Sort all indexed children in-place.
"""
elements = self.get_AllIndexed(index_sort=True)
self.clear_AllIndexed()
for element in elements:
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

5 changes: 5 additions & 0 deletions tests/model/TEMP1_Gutachten2-2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,11 @@
</pc:OrderedGroupIndexed>
<pc:RegionRefIndexed index="18" regionRef="Gutachten2-2_region0016"/>
<pc:RegionRefIndexed index="19" regionRef="Gutachten2-2_region0017"/>
<pc:UnorderedGroupIndexed id="unordered-group-for-testing_group" regionRef="unordered-group-for-testing" index="20">
<pc:RegionRef regionRef="unordered-group-for-testing_region0001"/>
</pc:UnorderedGroupIndexed>
<pc:UnorderedGroupIndexed id="empty-group-for-testing_group" regionRef="empty-group-for-testing" index="21">
</pc:UnorderedGroupIndexed>
</pc:OrderedGroup>
</pc:ReadingOrder>
<pc:TextRegion id="Gutachten2-2_region0001">
Expand Down
Loading