Merge pull request #196 from OCR-D/update-models-2.22.0

Update models docs to core 2.22.0
OCR-D · Jan 26, 2021 · 3d25337 · 3d25337
2 parents 7ea67e8 + f1ba884
commit 3d25337
Showing 1 changed file with 210 additions and 74 deletions.
diff --git a/site/en/models.md b/site/en/models.md
@@ -12,118 +12,254 @@ its own internal format(s) for models. Some support central storage of models
 at a specific location (tesseract, ocropy, kraken) while others require the full
 path to a model (calamari).
 
-Likewise, model distribution is not currently centralised within OCR-D though we
+Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core
+comes with a framework for managing processor resources uniformly. This means
+that processors can delegate to OCR-D/core to resolve specific file resources by name,
+looking in well-defined places in the filesystem. This also includes downloading and caching
+file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database
+of known resources, such as models, dictionaries, configurations and other
+processor-specific data files. This means that OCR-D users should be able to
+concentrate on fine-tuning their OCR workflows and not bother with implementation
+details like "where do I get models from and where do I put them".
+In particular, users can reference file parameters by name now.
+
+All of the above mentioned functionality can be accessed using the `ocrd
+resmgr` command line tool.
+
+## What models are available?
+
+To get a list of the resources that the OCR-D/core [is aware
+of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):
+
+```
+ocrd resmgr list-available
+```
+
+The output will look similar to this:
+
+```
+
+ocrd-calamari-recognize
+- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz)
+  Calamari model trained with GT4HistOCR
+- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
+  Calamari model trained with GT4HistOCR
+
+ocrd-cis-ocropy-recognize
+- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
+  ocropy historical latin model by github.com/chreul
+```
+
+As you can see, resources are grouped by the processors which make use of them.
+
+The word after the list symbol, e.g. `qurator-gt4hist-0.3`,
+`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can
+use in parameters without having to specify the full URL (in brackets after the
+name).
+
+The second line of each entry contains a short description of the resource.
+
+## Installing known resources
+
+You can install resources with the `ocrd resmgr download` command. It expects
+the name of the processor as the first argument and either the name or URL of a
+resource as a second argument.
+
+Although model distribution is not currently centralised within OCR-D, we
 are working towards a central model repository.
 
-In the meantime, this guide will show you, for each OCR engine:
+For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`:
 
-  * which types of models are supported
-  * where to store models locally
-  * which currently available models we recommend
-  * how to invoke the resp. OCR-D wrapper for the engine with a specific model
+```
+ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
+# or
+ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz
+```
 
-## Tesseract / ocrd_tesserocr
+This will look up the resource in the [bundled resource and user databases](#user-database), download,
+unarchive (where applicable) and store it in the [proper location](#where-is-the-data).
 
-Tesseract models are single files with a `.traineddata` extension.
 
-Tesseract expects models to be in a directory `tessdata` within what Tesseract
-calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that
-location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source
-using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched
-at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the
-locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you
-want the models location to be `$HOME/tessdata`, you can by adding to your
-`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`.
-
-We recommend you download the following models, either by downloading and
-saving to the right location or by running `make install-models-tesseract` when
-using `ocrd_all`:
-
-  * [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata)
-  * [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata)
-  * [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata)
-  * [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata)
-  * [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata)
-  * [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata)
-  * [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata)
-  * [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata)
-
-If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install
-standard models like `deu` or `script/Fraktur` with `apt`:
+**NOTE:** The special name `*` can be used instead of a resource name/url to
+download *all* known resources for this processor. To download all tesseract models:
 
 ```sh
-sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak
+ocrd resmgr download ocrd-tesserocr-recognize '*'
 ```
 
-**NOTE:** When installing with `apt`, he `script/*` models are installed
-without the `script/` prefix, so `script/Latin` becomes just `Latin`,
-`script/Fraktur` becomes `Fraktur` etc.
+**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource
+to download *all* known resources for *all* installed processors:
 
-OCR-D's Tesseract wrapper,
-[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
-specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
-model(s) to be provided as the `model` parameter. Multiple models can be
-combined by concatenating with `+` (which generally improves accuracy but always slows processing):
+    ocrd resmgr download '*'
+
+(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.)
+
+## Installing unknown resources
+
+If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`:
+
+To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`.
+
+```
+ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata
+```
+
+This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the
+[user database](#user-database).  You can then use it as the parameter value for the `model` parameter:
+
+```
+ocrd-tesserocr-recognize -P model mymodel
+```
+
+## List installed resources
+
+The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead
+of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing
+resources and lists URL and description if a database entry exists.
+
+## User database
+
+Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install
+a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at
+`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist.
+
+This allows you to use the OCR-D/core resource manager mechanics, including
+lookup of known resources by name or URL, without relying (only) on the
+database maintained by the OCR-D/core developers.
+
+**NOTE:** If you produced or found resources that are interesting for the wider
+OCR(-D) community, please tell us in the [OCR-D gitter
+chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database.
+
+## Where is the data
+
+The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters)
+
+In order of preference, a resource `<name>` for a processor `ocrd-foo` is searched at:
+
+* `$PWD/ocrd-resources/ocrd-foo/<name>`
+* `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/<name>`
+* `/usr/local/share/ocrd-resources/ocrd-foo/<name>`
+
+(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset).
+
+We recommend using the `$XDG_DATA_HOME` location, which is also the default. But
+you can override the location to store data with the `--location` option, which can
+be `cwd`, `data` and `system` resp.
 
 ```sh
-# Use the deu and frk models
-ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}'
-# Use the script/Fraktur model
-ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}'
+# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
+ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth
+# will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
+ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth
 ```
 
-## Ocropy / ocrd_cis
+## Changing the default resource directory
 
-An Ocropy model is simply the neural network serialized with Python's pickle
-mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
-extension.
+The `$XDG_DATA_HOME` default location is reasonable because
+models are usually large files which should persist across different deployments,
+both native and containerized, both single-module and [ocrd_all](https://github.com/OCR-D/ocrd_all).
+Moreover, that variable can easily be overridden during installation.
+
+However, there are use cases where `system` or even `cwd` should be
+used as location to store resources, hence the `--location` option.
 
-Ocropy has a rather convoluted algorithm to look up models, so we recommend you
-explicitly set the `OCROPUS_DATA` variable to point to the directory with
-ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following
-to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`.
 
-We recommend you download the following models, either by downloading and
-saving to the right location or by running `make install-models-ocropus` when
-using `ocrd_all`:
 
-  * [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz)
-  * [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz)
-  * [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`)
-  * [@chreul's  LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
+## Notes on specific processors
 
+## Ocropy / ocrd_cis
 
-To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
+An Ocropy model is simply the neural network serialized with Python's pickle
+mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
+extension and can be used as such, no need to unarchive.
+
+To use a specific model with OCR-D's ocropus wrapper in
+[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the
+`ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
 
 ```sh
-ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}'
+# Model will be downloaded on-demand if it is not locally available yet
+ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz
 ```
 
 ## Calamari / ocrd_calamari
 
 Calamari models are Tensorflow model directories. For distribution, this
 directory is usually packed to a tarball or ZIP file. Once downloaded, these
-containers must be unpacked to a directory again.
-
-As calamari does not have a model discovery setup, you must always provide the
-path with a wildcard listing all `*.ckpt.json` ("checkpoint") files.
-
-We recommend you download the following model, either by downloading and
-unpacking manually or by using `make install-models-calamari` if using
-`ocrd_all`:
-
-  * [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
+containers must be unpacked to a directory again. `ocrd resmgr` handles this
+for you, so you just need the name of the resource in the database.
 
 The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models).
 
 To use a specific model with OCR-D's calamari wrapper
 [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically,
-the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter:
+the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter:
+
+```sh
+# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA
+# To use your own trained model
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir
+# or, to be able to control which checkpoints to use:
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json'
+```
+
+## Tesseract / ocrd_tesserocr
+
+Tesseract models are single files with a `.traineddata` extension.
+
+Since tesseract only supports model lookup in a single directory, models should
+only be stored in a single location. If the default location (`virtualenv`) is
+not the place you want to use for tesseract models, consider [changing the default location
+in the OCR-D config file](#changing-the-default-resource-directory).
+
+**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors
+reuse the resource directory for `ocrd-tesserocr-recognize`.
+
+If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors
+are called, it will be the location to look for resources instead of the default.
+
+OCR-D's Tesseract wrapper,
+[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
+specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
+model(s) to be provided as the `model` parameter. Multiple models can be
+combined by concatenating with `+` (which generally improves accuracy but always slows processing):
 
 ```sh
-ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}'
+# Use the deu and frk models
+ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk'
+# Use the Fraktur model
+ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur
 ```
 
+# Models and Docker
+
+We recommend keeping all downloaded resources in a persistent host directory,
+separate of the `ocrd/*` Docker container and data directory, and mounting that
+resource directory into a specific path in the container alongside the data directory.
+The host resource directory can be empty initially. Each time you run the Docker container,
+your processors will access the host directory to resolve resources, and you can download
+additional models into that location using `ocrd resmgr`.
+
+The following will assume (without loss of generality) that your host-side data
+path is under `./data`, and the host-side resource path is under `./models`:
+
+- To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in Docker:
+        docker run --user $(id -u) \
+          --volume $PWD/models:/usr/local/share/ocrd-resources \
+        ocrd/all \
+        ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \
+        ocrd resmgr download ocrd-calamari-recognize default\; \
+        ...
+- To run processors, as usual do:
+        docker run --user $(id -u) --workdir /data \
+          --volume $PWD/data:/data \
+          --volume $PWD/models:/usr/local/share/ocrd-resources \
+          ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng
+
+This principle applies to all `ocrd/*` Docker images, e.g. you can replace `ocrd/all` above with `ocrd/tesserocr` as well.
+
 # Model training
 
 With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the