Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor resource discovery #559

Merged
merged 74 commits into from
Jan 26, 2021
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
2257876
utils: implement resource lookup logic
kba Aug 10, 2020
a7b8001
utils: list_all_resources to list all processor resources
kba Aug 10, 2020
7e362d7
ocrd_utils.constants: XDG_CACHE_HOME
kba Aug 10, 2020
78e84a2
list_all_resources: also look in XDG_CACHE_HOME
kba Aug 10, 2020
5c75f40
Processor: implement resolve_resource and list_all_resources
kba Aug 10, 2020
bb63210
resolve_resource: also look in XDG_CACHE_HOME
kba Aug 10, 2020
297a0f3
Processor: fix signature for list_resource_candidates
kba Aug 10, 2020
c999229
initial test of list_resource_candidates
kba Aug 25, 2020
479cedd
test_os: not all test environments have VIRTUAL_ENV set
kba Oct 13, 2020
247bf4d
Merge branch 'master' into resolve-files
kba Oct 19, 2020
8e0b595
Merge branch 'master' into resolve-files
kba Oct 27, 2020
0d97c2d
wip
kba Oct 27, 2020
a54b5bf
Merge branch 'master' into resolve-files
kba Dec 11, 2020
6e76653
fixes merge error {f,}chmod
kba Dec 11, 2020
afcd117
run non-logging unit tests with standard $HOME
kba Dec 15, 2020
a3226b1
implement -C/-L cmdline flags
kba Dec 16, 2020
c2e0460
schema for resource list
kba Dec 21, 2020
5a6ccf3
implement foundation of ocrd resmgr
kba Dec 21, 2020
128f6b7
ocrd resmgr list-{installed,available} same output
kba Dec 21, 2020
0f42da6
resmgr: basic downloading of urls of files
kba Dec 22, 2020
b63b4d8
resmgr: support parameter_usage different from resource name
kba Dec 22, 2020
3f0eeac
add more models to resource_list.yml
kba Dec 22, 2020
6e6e424
resmgr: simplify resource typing
kba Dec 22, 2020
5843e1b
resmgr: support tarball downloads
kba Dec 22, 2020
0edac70
search for resources only on top-level
kba Dec 22, 2020
f96ce5e
use resmgr in Processor.resolve_resource
kba Dec 22, 2020
fa90f1b
simplify Processor.resolve_resource, delegate to resmgr as much as po…
kba Dec 22, 2020
89f77f0
resmgr: add anybaseocr resources
kba Dec 23, 2020
18009a7
resmgr download: show progressbar, add size to resource list
kba Dec 23, 2020
2df4c22
fix resmgr test
kba Dec 23, 2020
e8d0e0f
resmgr download: * to download all resources for this model
kba Dec 23, 2020
1fd35b9
:package: pre-release 2.22.0b1
kba Dec 28, 2020
849de10
new PAGE XML user method get_AllTextLine
kba Dec 30, 2020
bf47a07
update assets
kba Dec 30, 2020
db36dd3
kraken resources
kba Dec 30, 2020
a571b82
:package: pre-release 2.22.0b2
kba Dec 30, 2020
3aa60a8
reslist: use name w/o slash
kba Dec 30, 2020
4bf12fb
:package: pre-release v2.22.0b3
kba Dec 31, 2020
2c26eb0
Update ocrd/ocrd/processor/base.py
kba Jan 4, 2021
02e6415
rename PAGE method get_AllTextLine{,s}
kba Jan 4, 2021
4def1d9
OcrdPage.get_AllTextLines: support region_order, stub for textline_order
kba Jan 4, 2021
7911603
ocrd resmgr list-installed: look in fs for candidates
kba Jan 5, 2021
54a214a
resource_list.yml: typo: ocrd{,-cis}-ocropy-recognize
kba Jan 5, 2021
e33346b
resmgr list-installed: create stub in user resource list for unregist…
kba Jan 5, 2021
3ee66ce
resmgr: use last URL segment as the resource name
kba Jan 5, 2021
b21d462
resmgr: unquote URL encoded path
kba Jan 5, 2021
d8d97af
resmgr: use GET instead of HEAD for content-length
kba Jan 6, 2021
509200c
resmgr: support "download" (=copying) of local files
kba Jan 7, 2021
cbbc09a
resmgr, introduce intermediary "ocrd-resource" dir
kba Jan 12, 2021
7b1b6c9
default to VIRTUAL_ENV sharedir
kba Jan 12, 2021
565ba38
resmgr: save stub on download
kba Jan 12, 2021
012e49e
get_AllTextLines: implement textlineOrder
bertsky Jan 12, 2021
199b430
resmgr: ocrd-resources also for list_resource_candidates
kba Jan 12, 2021
8349807
resmgr: add @stweil's ONB model to list
kba Jan 14, 2021
7840b5b
resmgr: when wildcard downloading, omit ??? user entries
kba Jan 18, 2021
5038005
add a config file $XDG_CONFIG_HOME/ocrd.yml
kba Jan 19, 2021
2ab2151
ocrd resmgr: use resource_location from config for default
kba Jan 19, 2021
4687886
config: merge with default config for updated config
kba Jan 19, 2021
032929e
move config file to $XDG_CONFIG_HOME/ocrd/config.yml for consistency
kba Jan 19, 2021
c6a53b0
resource manager: methods to resolve resource dirs
kba Jan 19, 2021
a741a72
Merge branch 'master' into resolve-files
kba Jan 20, 2021
53a591d
:package: v2.22.0b4
kba Jan 20, 2021
a05ecf4
fix ocrd_config test
kba Jan 20, 2021
61a8845
config: mkdir -p $(basename)
kba Jan 20, 2021
fd8ca26
:bug: resmgr: virtualenv location was missing "share"
kba Jan 21, 2021
a3cff9e
resmgr: show shorthand location in list-installed
kba Jan 21, 2021
9280ef4
remove virtualenv, introduce /usr/local/share
kba Jan 22, 2021
9cb058a
:fire: remove configuration file
kba Jan 22, 2021
7e26a07
resmgr: lookup in XDG_DATA_HOME and absolute path only
kba Jan 22, 2021
22fb2c6
resmgr download: be stricter about uninstalled processors
kba Jan 25, 2021
2b3cb64
resmgr download "*"
kba Jan 25, 2021
ac74c3d
Update ocrd/ocrd/resource_manager.py
kba Jan 25, 2021
134a0c1
Update ocrd/ocrd/resource_manager.py
kba Jan 25, 2021
a5858ec
allow "from ocrd import OcrdResourceManager"
kba Jan 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,38 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

Fixed:

* `run_cli`: don't reference undefined vars in error handler, #651
## [2.22.0b4] - 2021-01-20

Added:

* Implement file resource algorithm from OCR-D/spec#169, #559
* New CLI `ocrd resmgr` to download/browse processor resources, #559
* `Workspace.rename_file_group` with CLI `ocrd workspace rename-group` to rename file groups, #646

Changed:

* `ocrd workspace add`: guess `--mimetype` if not provided, #658
* `ocrd workspace add`: warn if `--page-id` not provided, #659

## [2.22.0b3] - 2020-12-30

Fixed:
* `name` of resources mustn't contain slash `/`

## [2.22.0b2] - 2020-12-30

Added:

* PAGE API method `get_AllTextLines`
* resources for kraken

## [2.22.0b1] - 2020-12-28

Fixed:

* `run_cli`: don't reference undefined vars in error handler, #651


## [2.21.0] - 2020-11-27

Changed:
Expand Down Expand Up @@ -1249,6 +1268,10 @@ Fixed
Initial Release

<!-- link-labels -->
[2.22.0b4]: ../../compare/v2.22.0b4..v2.22.0b3
[2.22.0b3]: ../../compare/v2.22.0b3..v2.22.0b2
[2.22.0b2]: ../../compare/v2.22.0b2..v2.22.0b1
[2.22.0b1]: ../../compare/v2.22.0b1..v2.21.0
[2.21.0]: ../../compare/v2.21.0..v2.20.2
[2.20.2]: ../../compare/v2.20.2..v2.20.1
[2.20.1]: ../../compare/v2.20.1..v2.20.0
Expand Down
6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,12 @@ deps-ubuntu:

# Install test python deps via pip
deps-test:
$(PIP) install -U "pip>=19.0.0"
$(PIP) install -U "pip>=19.0.0,!=20.3.2"
$(PIP) install -r requirements_test.txt

# (Re)install the tool
install:
$(PIP) install -U "pip>=19.0.0" wheel
$(PIP) install -U "pip>=19.0.0,!=20.3.2" wheel
for mod in $(BUILD_ORDER);do (cd $$mod ; $(PIP_INSTALL) .);done

# Install with pip install -e
Expand Down Expand Up @@ -148,7 +148,7 @@ assets-server:
test: assets
HOME=$(CURDIR)/ocrd_utils $(PYTHON) -m pytest --continue-on-collection-errors -k TestLogging $(TESTDIR)
HOME=$(CURDIR) $(PYTHON) -m pytest --continue-on-collection-errors -k TestLogging $(TESTDIR)
HOME=$(CURDIR) $(PYTHON) -m pytest --continue-on-collection-errors --ignore=$(TESTDIR)/test_logging.py $(TESTDIR)
$(PYTHON) -m pytest --continue-on-collection-errors --ignore=$(TESTDIR)/test_logging.py $(TESTDIR)

test-profile:
$(PYTHON) -m cProfile -o profile $$(which pytest)
Expand Down
2 changes: 2 additions & 0 deletions ocrd/ocrd/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def get_help(self, ctx):
from ocrd.cli.process import process_cli
from ocrd.cli.bashlib import bashlib_cli
from ocrd.cli.validate import validate_cli
from ocrd.cli.resmgr import resmgr_cli
from ocrd.decorators import ocrd_loglevel
from .zip import zip_cli
from .log import log_cli
Expand All @@ -37,3 +38,4 @@ def cli(**kwargs): # pylint: disable=unused-argument
cli.add_command(zip_cli)
cli.add_command(validate_cli)
cli.add_command(log_cli)
cli.add_command(resmgr_cli)
124 changes: 124 additions & 0 deletions ocrd/ocrd/cli/resmgr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
import sys
from os import getcwd
from os.path import join
from pathlib import Path
import requests

import click

from ocrd_utils import (
initLogging,
getLogger,
RESOURCE_LOCATIONS
)
from ocrd_validators import OcrdZipValidator

from ..resource_manager import OcrdResourceManager

def print_resources(executable, reslist, resmgr):
print('%s' % executable)
for resdict in reslist:
print('- %s %s (%s)\n %s' % (
resdict['name'],
'@ %s' % resmgr.resource_dir_to_location(resdict['path']) if 'path' in resdict else '',
resdict['url'],
resdict['description']
))
print()

@click.group("resmgr")
def resmgr_cli():
"""
Managing processor resources
"""
initLogging()

@resmgr_cli.command('list-available')
@click.option('-e', '--executable', help='Show only resources for executable EXEC', metavar='EXEC')
def list_available(executable=None):
"""
List available resources
"""
resmgr = OcrdResourceManager()
for executable, reslist in resmgr.list_available(executable):
print_resources(executable, reslist, resmgr)

@resmgr_cli.command('list-installed')
@click.option('-e', '--executable', help='Show only resources for executable EXEC', metavar='EXEC')
def list_installed(executable=None):
"""
List installed resources
"""
resmgr = OcrdResourceManager()
ret = []
for executable, reslist in resmgr.list_installed(executable):
print_resources(executable, reslist, resmgr)

@resmgr_cli.command('download')
@click.option('-n', '--any-url', help='Allow downloading/copying unregistered resources', is_flag=True)
@click.option('-o', '--overwrite', help='Overwrite existing resources', is_flag=True)
@click.option('-l', '--location', help='Where to store resources', type=click.Choice(RESOURCE_LOCATIONS), default='data', show_default=True)
@click.argument('executable', required=True)
@click.argument('url_or_name', required=True)
def download(any_url, overwrite, location, executable, url_or_name):
"""
Download resource URL_OR_NAME for processor EXECUTABLE.

URL_OR_NAME can either be the ``name`` or ``url`` of a registered resource.

If URL_OR_NAME is '*' (asterisk), download all known resources for this processor

If ``--any-url`` is given, also accepts URL or filenames of non-registered resources for ``URL_OR_NAME``.
"""
log = getLogger('ocrd.cli.resmgr')
resmgr = OcrdResourceManager()
basedir = resmgr.location_to_resource_dir(location)
is_url = url_or_name.startswith('https://') or url_or_name.startswith('http://')
is_filename = Path(url_or_name).exists()
find_kwargs = {'executable': executable}
if url_or_name != '*':
find_kwargs['url' if is_url else 'name'] = url_or_name
reslist = resmgr.find_resources(**find_kwargs)
if not reslist:
log.info("No resources found in registry")
if any_url and (is_url or is_filename):
log.info("%s unregistered resource %s" % ("Downloading" if is_url else "Copying", url_or_name))
if is_url:
with requests.get(url_or_name, stream=True) as r:
content_length = int(r.headers.get('content-length'))
else:
url_or_name = str(Path(url_or_name).resolve())
content_length = Path(url_or_name).stat().st_size
with click.progressbar(length=content_length, label="Downloading" if is_url else "Copying") as bar:
fpath = resmgr.download(
executable,
url_or_name,
kba marked this conversation as resolved.
Show resolved Hide resolved
overwrite=overwrite,
basedir=basedir,
progress_cb=lambda delta: bar.update(delta))
log.info("%s resource '%s' (%s) not a known resource, creating stub in %s'" % (executable, fpath.name, url_or_name, resmgr.user_list))
resmgr.add_to_user_database(executable, fpath, url_or_name)
log.info("%s %s to %s" % ("Downloaded" if is_url else "Copied", url_or_name, fpath))
log.info("Use in parameters as '%s'" % fpath.name)
else:
sys.exit(1)
else:
for _, resdict in reslist:
if resdict['url'] == '???':
log.info("Cannot download user resource %s" % (resdict['name'])),
continue
log.info("Downloading resource %s" % resdict)
with click.progressbar(length=resdict['size']) as bar:
fpath = resmgr.download(
executable,
resdict['url'],
name=resdict['name'],
resource_type=resdict['type'],
path_in_archive=resdict.get('path_in_archive', '.'),
overwrite=overwrite,
basedir=basedir,
progress_cb=lambda delta: bar.update(delta)
)
log.info("Downloaded %s to %s" % (resdict['url'], fpath))
log.info("Use in parameters as '%s'" % resmgr.parameter_usage(resdict['name'], usage=resdict['parameter_usage']))

4 changes: 4 additions & 0 deletions ocrd/ocrd/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,16 @@
'DOWNLOAD_DIR',
'DEFAULT_REPOSITORY_URL',
'BASHLIB_FILENAME',
'RESOURCE_LIST_FILENAME',
'BACKUP_DIR',
'RESOURCE_USER_LIST_COMMENT',
]

TMP_PREFIX = 'ocrd-core-'
DEFAULT_UPLOAD_FOLDER = '/tmp/uploads-ocrd-core'
DOWNLOAD_DIR = '/tmp/ocrd-core-downloads'
DEFAULT_REPOSITORY_URL = 'http://localhost:5000/'
BASHLIB_FILENAME = resource_filename(__name__, 'lib.bash')
RESOURCE_LIST_FILENAME = resource_filename(__name__, 'resource_list.yml')
RESOURCE_USER_LIST_COMMENT = "# OCR-D private resource list (consider sending a PR with your own resources to OCR-D/core)"
BACKUP_DIR = '.backup'
13 changes: 11 additions & 2 deletions ocrd/ocrd/decorators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,22 @@ def ocrd_cli_wrap_processor(
help=False, # pylint: disable=redefined-builtin
version=False,
overwrite=False,
show_resource=None,
list_resources=False,
**kwargs
):
if not sys.argv[1:]:
processorClass(workspace=None, show_help=True)
sys.exit(1)
if dump_json or help or version:
processorClass(workspace=None, dump_json=dump_json, show_help=help, show_version=version)
if dump_json or help or version or show_resource or list_resources:
processorClass(
workspace=None,
dump_json=dump_json,
show_help=help,
show_version=version,
show_resource=show_resource,
list_resources=list_resources
)
sys.exit()
else:
initLogging()
Expand Down
3 changes: 3 additions & 0 deletions ocrd/ocrd/decorators/ocrd_cli_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def ocrd_cli_options(f):
def cli(mets_url):
print(mets_url)
"""
# XXX Note that the `--help` output is statically generate_processor_help
params = [
option('-m', '--mets', help="METS to process", default="mets.xml"),
option('-w', '--working-dir', help="Working Directory"),
Expand All @@ -25,6 +26,8 @@ def cli(mets_url):
option('-O', '--output-file-grp', help='File group(s) used as output.', default='OUTPUT'),
option('-g', '--page-id', help="ID(s) of the pages to process"),
option('--overwrite', help="Overwrite the output file group or a page range (--page-id)", is_flag=True, default=False),
option('-C', '--show-resource', help='Dump the content of processor resource RESNAME', metavar='RESNAME'),
option('-L', '--list-resources', is_flag=True, default=False, help='List names of processor resources'),
parameter_option,
parameter_override_option,
option('-J', '--dump-json', help="Dump tool description as JSON and exit", is_flag=True, default=False),
Expand Down
79 changes: 76 additions & 3 deletions ocrd/ocrd/processor/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,34 @@
Processor base class and helper functions
"""

__all__ = ['Processor', 'generate_processor_help', 'run_cli', 'run_processor']
__all__ = [
'Processor',
'generate_processor_help',
'run_cli',
'run_processor'
]

import os
from os import makedirs
from os.path import exists, isdir, join
from shutil import copyfileobj
import json
from ocrd_utils import VERSION as OCRD_VERSION, MIMETYPE_PAGE, getLogger
import os
import re
import sys

import requests

from ocrd_utils import (
VERSION as OCRD_VERSION,
MIMETYPE_PAGE,
getLogger,
initLogging,
list_resource_candidates,
list_all_resources,
)
from ocrd_validators import ParameterValidator
from ocrd_models.ocrd_page import MetadataItemType, LabelType, LabelsType
from ..resource_manager import OcrdResourceManager

# XXX imports must remain for backwards-compatibilty
from .helpers import run_cli, run_processor, generate_processor_help # pylint: disable=unused-import
Expand All @@ -33,6 +54,8 @@ def __init__(
input_file_grp="INPUT",
output_file_grp="OUTPUT",
page_id=None,
show_resource=None,
list_resources=False,
show_help=False,
show_version=False,
dump_json=False,
Expand All @@ -43,6 +66,20 @@ def __init__(
if dump_json:
print(json.dumps(ocrd_tool, indent=True))
return
if list_resources:
for res in list_all_resources(ocrd_tool['executable']):
print(res)
return
if show_resource:
res_fname = list_resource_candidates(ocrd_tool['executable'], show_resource, is_file=True)
if not res_fname:
initLogging()
logger = getLogger('ocrd.%s.__init__' % ocrd_tool['executable'])
logger.error("Failed to resolve %s for processort %s" % (show_resource, ocrd_tool['executable']))
else:
with open(res_fname[0], 'rb') as f:
copyfileobj(f, sys.stdout.buffer)
return
self.ocrd_tool = ocrd_tool
if show_help:
self.show_help()
Expand Down Expand Up @@ -84,6 +121,7 @@ def process(self):
"""
raise Exception("Must be implemented")


def add_metadata(self, pcgts):
"""
Adds PAGE-XML MetadataItem describing the processing step
Expand All @@ -107,6 +145,41 @@ def add_metadata(self, pcgts):
value=OCRD_VERSION)])
]))

def resolve_resource(self, val):
"""
Resolve a resource name to an absolute file path with the algorithm in
https://ocr-d.de/en/spec/ocrd_tool#file-parameters

Args:
val (string): resource value to resolve
"""
executable = self.ocrd_tool['executable']
if exists(val):
return val
ret = [cand for cand in list_resource_candidates(executable, val) if exists(cand)]
if ret:
return ret[0]
resmgr = OcrdResourceManager()
reslist = resmgr.find_resources(executable, name=val)
if not reslist:
reslist = resmgr.find_resources(executable, url=val)
if not reslist:
raise FileNotFoundError("Could not resolve %s resource '%s'" % (executable, val))
_, resdict = reslist[0]
return str(resmgr.download(
executable,
url=resdict['url'],
name=resdict['name'],
path_in_archive=resdict['path_in_archive'],
resource_type=resdict['type']
))

def list_all_resources(self):
"""
List all resources found in the filesystem
"""
return list_all_resources(self.ocrd_tool['executable'])
bertsky marked this conversation as resolved.
Show resolved Hide resolved

@property
def input_files(self):
"""
Expand Down
Loading