Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PageType.get_AllRegions to list all kinds of regions #479

Merged
merged 36 commits into from
Jun 4, 2020
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
abef352
PageType.get_AllRegions to list all kinds of regions
kba May 13, 2020
3445f87
Update ocrd_models/ocrd_page_user_methods.py
bertsky May 14, 2020
a48b8c1
update generateds page, add region filter if using reading order, wip
kba May 14, 2020
f51a2e4
Merge branch 'hotfix-ocrd-page-exports' into get-all-regions
kba May 14, 2020
8da3f3c
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 14, 2020
d2a01bb
refactoring: move generateDS methods to their own files
kba May 15, 2020
be7f026
get_AllRegions: adapt to signature proposed in #240, test with order=…
kba May 15, 2020
e1740f7
README: explain how to add user methods to PAGE API
kba May 15, 2020
6f9163e
Update ocrd_models/README.md
kba May 28, 2020
0c73b3e
Update ocrd_models/README.md
kba May 28, 2020
5c2f3a8
Update ocrd_models/README.md
kba May 28, 2020
6a57506
recursion (with both finite or arbitrary depth) for get_AllRegions
kba May 28, 2020
a9072c8
regenerate PAGE API
kba May 28, 2020
ac62b85
get_AllRegions: clean-up merge artifacts and reorganize
kba May 28, 2020
fd6d545
Update ocrd_models/ocrd_page_user_methods/get_AllRegions.py
kba May 28, 2020
86a7133
get_AllRegions: _region_id method unneccessary now
kba May 28, 2020
ce06392
Merge branch 'get-all-regions' of https://github.com/kba/ocrd-core in…
kba May 28, 2020
5c8d89b
regenerate PAGE API
kba May 28, 2020
f6e3da5
:art: pylint
kba May 28, 2020
8351056
add_AllIndexed -> extend_AllIndexed
kba May 28, 2020
f202205
get_AllRegions: differentiate "reading-order"/"reading-order-only"
kba May 28, 2020
ffba6f9
get_AllRegions: catch negative depth, test depth==0
kba May 29, 2020
207f396
:memo: get_AllRegions: document example
bertsky May 29, 2020
9ced315
get_AllRegions: fix recursion
kba May 29, 2020
629f38d
get_AllRegions: Update example
kba May 29, 2020
e958559
wip
kba May 29, 2020
1964563
reading order test sample: add unorderedgroups for testing
kba May 29, 2020
27e256f
add get_UnorderedGroupChildren, let get_AllIndexed handle UnorderedGr…
kba May 29, 2020
1b17e3f
get_AllIndexed: allow filtering by child type
kba May 29, 2020
ae613cf
get_AllIndexed: index_sort parameter to enable/disable sorting
kba May 29, 2020
b1df95f
add sort_AllIndexed to sort in-place
kba May 29, 2020
fd9dc83
extend_AllIndexed: increment @index when adding elements
kba May 29, 2020
9d0e539
Merge branch 'master' into get-all-regions
kba May 29, 2020
84f1d33
:memo: changelog
kba May 29, 2020
0e14633
Document extend_AllIndexed validate_contiunuity param
kba Jun 3, 2020
b79474a
Merge branch 'master' into get-all-regions
kba Jun 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions ocrd_models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,34 @@
> OCR-D framework - file format APIs and schemas

See https://github.com/OCR-D/core

## Adding user methods to the generated PAGE API

Let's say you want to add a method `get_FirstTextRegion` on the pc:Page element:

1. Create a file `ocrd_models/ocrd_page_user_methods/get_FirstTextRegion.py`

```python
def get_FirstTextRegion(self):
return self.get_TextRegion[0]
```

kba marked this conversation as resolved.
Show resolved Hide resolved
**NOTE** The method name and file name must be identical.

**NOTE** Do not use Python's `%` string interpolation operator, it will break generateDS. Use `"".format(...)` instead.

2. Edit `ocrd_models/ocrd_page_user_methods.py` and append to the `METHOD_SPECS` list:

```python
METHOD_SPECS = (
# ...
_add_method(r'^PageType$', 'get_FirstTextRegion')
# ...
)
```

3. Regenerate the PAGE API:

```sh
make generate-page
```
186 changes: 131 additions & 55 deletions ocrd_models/ocrd_models/ocrd_page_generateds.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
# -*- coding: utf-8 -*-

#
# Generated Wed May 13 16:09:07 2020 by generateDS.py version 2.35.20.
# Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
# Generated Thu May 28 20:28:48 2020 by generateDS.py version 2.35.20.
# Python 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
#
# Command line options:
# ('-f', '')
Expand All @@ -16,7 +16,7 @@
# repo/assets/data/schema/data/2019.xsd
#
# Command line:
# /home/kba/miniconda3/bin/generateDS -f --root-element="PcGts" -o "ocrd_models/ocrd_models/ocrd_page_generateds.py" --disable-generatedssuper-lookup --user-methods="ocrd_models/ocrd_page_user_methods.py" repo/assets/data/schema/data/2019.xsd
# /home/kba/ocrd_all/venv/bin/generateDS -f --root-element="PcGts" -o "ocrd_models/ocrd_models/ocrd_page_generateds.py" --disable-generatedssuper-lookup --user-methods="ocrd_models/ocrd_page_user_methods.py" repo/assets/data/schema/data/2019.xsd
#
# Current working directory (os.getcwd()):
# core
Expand Down Expand Up @@ -2850,6 +2850,80 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'CustomRegion'
def __hash__(self):
return hash(self.id)
# pylint: disable=line-too-long,invalid-name,protected-access,missing-module-docstring
def _region_class(self, x): # pylint: disable=unused-argument
return x.__class__.__name__.replace('RegionType', '')

def _get_recursive_regions(self, regions, level, classes=None):
if level == 1:
# stop recursion, filter classes
if classes:
return [r for r in regions if self._region_class(r) in classes]
# remove the first element (PageType)
return list(set(regions[1:]))
# find more regions recursively
more_regions = []
for region in regions:
more_regions.append([])
for class_ in ['Advert', 'Chart', 'Chem', 'Custom', 'Graphic', 'Image',
'LineDrawing', 'Map', 'Maths', 'Music', 'Noise',
'Separator', 'Table', 'Text', 'Unknown']:
if class_ == 'Map' and not isinstance(region, PageType): # pylint: disable=undefined-variable
# 'Map' is not recursive in 2019 schema
continue
more_regions[-1] += getattr(region, 'get_{}Region'.format(class_))()
if not any(more_regions):
return _get_recursive_regions(regions, 1, classes)
regions = [region for r, more in zip(regions, more_regions) for region in [r] + more]
return self._get_recursive_regions(regions, level - 1 if level else 0, classes)

def _get_recursive_reading_order(self, rogroup):
if isinstance(rogroup, (OrderedGroupType, OrderedGroupIndexedType)): # pylint: disable=undefined-variable
elements = rogroup.get_AllIndexed()
if isinstance(rogroup, (UnorderedGroupType, UnorderedGroupIndexedType)): # pylint: disable=undefined-variable
elements = (rogroup.get_RegionRef() + rogroup.get_OrderedGroup() + rogroup.get_UnorderedGroup())
regionrefs = list()
for elem in elements:
regionrefs.append(elem.get_regionRef())
if not isinstance(elem, (RegionRefType, RegionRefIndexedType)): # pylint: disable=undefined-variable
regionrefs.extend(self._get_recursive_reading_order(elem))
return regionrefs

def get_AllRegions(self, classes=None, order='document', depth=1):
"""
Get all the *Region element or only those provided by ``classes``.
Returned in document order unless ``order`` is ``reading-order``
Arguments:
classes (list) Classes of regions that shall be returned, e.g. ``['Text', 'Image']``
order ("document"|"reading-order"|"reading-order-only") Whether to
return regions sorted by document order (``document``, default) or by
reading order with regions not in the reading order at the end of the
returned list (``reading-order``) or regions not in the reading order
omitted (``reading-order-only``)
depth (int) Recursive depth to look for regions at. Default: 1
"""
if order not in ['document', 'reading-order', 'reading-order-only']:
raise Exception("Argument 'order' must be either 'document' or 'reading-order', not '{}'".format(order))
ret = self._get_recursive_regions([self], depth + 1, classes)
if order.startswith('reading-order'):
reading_order = self.get_ReadingOrder()
if reading_order:
reading_order = reading_order.get_OrderedGroup() or reading_order.get_UnorderedGroup()
if reading_order:
reading_order = self._get_recursive_reading_order(reading_order)
if reading_order:
id2region = {region.id: region for region in ret}
in_reading_order = [id2region[region_id] for region_id in reading_order if region_id in id2region]
# print("ret: {} / in_ro: {} / not-in-ro: {}".format(
# len(ret),
# len([id2region[region_id] for region_id in reading_order if region_id in id2region]),
# len([r for r in ret if r not in in_reading_order])
# ))
if order == 'reading-order-only':
ret = in_reading_order
else:
ret = in_reading_order + [r for r in ret if r not in in_reading_order]
return ret
# end class PageType


Expand Down Expand Up @@ -5347,9 +5421,31 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroupIndexed'
def __hash__(self):
return hash(self.id)

def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True):
eol_ = '\n' if pretty_print else ''
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)

def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def extend_AllIndexed(self, elements):
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x: x.index):
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True): # pylint: disable=unused-argument,too-many-arguments
namespaceprefix_ = 'pc:'
if self.UserDefined is not None:
self.UserDefined.export(outfile, level, namespaceprefix_, namespacedef_='', name_='UserDefined', pretty_print=pretty_print)
Expand All @@ -5358,36 +5454,15 @@ def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xml
cleaned = []
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed():
rri = RegionRefIndexedType.factory(parent_object_=self)
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed(): # pylint: disable=undefined-variable
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
else:
cleaned.append(entry)
for entry in cleaned:
entry.export(outfile, level, namespaceprefix_, namespacedef_='', name_=entry.__class__.__name__[:-4], pretty_print=pretty_print)

def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)
def add_AllIndexed(self, elements):
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x : x.index):
if isinstance(element, RegionRefIndexedType):
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType):
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType):
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret
# end class OrderedGroupIndexedType


Expand Down Expand Up @@ -6136,9 +6211,31 @@ def buildChildren(self, child_, node, nodeName_, fromsubclass_=False, gds_collec
obj_.original_tagname_ = 'UnorderedGroupIndexed'
def __hash__(self):
return hash(self.id)

def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True):
eol_ = '\n' if pretty_print else ''
def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)

def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def extend_AllIndexed(self, elements):
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x: x.index):
if isinstance(element, RegionRefIndexedType): # pylint: disable=undefined-variable
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType): # pylint: disable=undefined-variable
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

# pylint: disable=line-too-long,invalid-name,missing-module-docstring,missing-function-docstring
def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"', name_='OrderedGroupType', fromsubclass_=False, pretty_print=True): # pylint: disable=unused-argument,too-many-arguments
namespaceprefix_ = 'pc:'
if self.UserDefined is not None:
self.UserDefined.export(outfile, level, namespaceprefix_, namespacedef_='', name_='UserDefined', pretty_print=pretty_print)
Expand All @@ -6147,36 +6244,15 @@ def exportChildren(self, outfile, level, namespaceprefix_='', namespacedef_='xml
cleaned = []
# remove emtpy groups and replace with RegionRefIndexedType
for entry in self.get_AllIndexed():
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed():
rri = RegionRefIndexedType.factory(parent_object_=self)
if isinstance(entry, (UnorderedGroupIndexedType, OrderedGroupIndexedType)) and not entry.get_AllIndexed(): # pylint: disable=undefined-variable
rri = RegionRefIndexedType.factory(parent_object_=self) # pylint: disable=undefined-variable
rri.index = entry.index
rri.regionRef = entry.regionRef
cleaned.append(rri)
else:
cleaned.append(entry)
for entry in cleaned:
entry.export(outfile, level, namespaceprefix_, namespacedef_='', name_=entry.__class__.__name__[:-4], pretty_print=pretty_print)

def get_AllIndexed(self):
return sorted(self.get_RegionRefIndexed() + self.get_OrderedGroupIndexed() + self.get_UnorderedGroupIndexed(), key=lambda x : x.index)
def add_AllIndexed(self, elements):
if not isinstance(elements, list):
elements = [elements]
for element in sorted(elements, key=lambda x : x.index):
if isinstance(element, RegionRefIndexedType):
self.add_RegionRefIndexed(element)
elif isinstance(element, OrderedGroupIndexedType):
self.add_OrderedGroupIndexed(element)
elif isinstance(element, UnorderedGroupIndexedType):
self.add_UnorderedGroupIndexed(element)
return self.get_AllIndexed()

def clear_AllIndexed(self):
ret = self.get_AllIndexed()
self.set_RegionRefIndexed([])
self.set_OrderedGroupIndexed([])
self.set_UnorderedGroupIndexed([])
return ret
# end class OrderedGroupType


Expand Down
Loading