OCR-D · kba · Jan 26, 2021 · Jan 18, 2021 · Jan 19, 2021 · Jan 20, 2021
diff --git a/site/en/models.md b/site/en/models.md
@@ -12,118 +12,251 @@ its own internal format(s) for models. Some support central storage of models
 at a specific location (tesseract, ocropy, kraken) while others require the full
 path to a model (calamari).
 
-Likewise, model distribution is not currently centralised within OCR-D though we
+Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core
+comes with a framework for managing processor resources uniformly. This means
+that processors can delegate to OCR-D/core to resolve specific file resources by name,
+looking in well-defined places in the filesystem. This also includes downloading and caching
+file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database
+of known resources, such as models, dictionaries, configurations and other
+processor-specific data files. This means that OCR-D users should be able to
+concentrate on fine-tuning their OCR workflows and not bother with implementation
+details like "where do I get models from and where do I put them".
+In particular, users can reference file parameters by name now.
+
+All of the above mentioned functionality can be accessed using the `ocrd
+resmgr` command line tool.
+
+## What models are available?
+
+To get a list of the resources that the OCR-D/core [is aware
+of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):
+
+```
+ocrd resmgr list-available
+```
+
+The output will look similar to this:
+
+```
+
+ocrd-calamari-recognize
+- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz)
+  Calamari model trained with GT4HistOCR
+- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
+  Calamari model trained with GT4HistOCR
+
+ocrd-cis-ocropy-recognize
+- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
+  ocropy historical latin model by github.com/chreul
+```
+
+As you can see, resources are grouped by the processors which make use of them.
+
+The word after the list symbol, e.g. `qurator-gt4hist-0.3`,
+`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can
+use in parameters without having to specify the full URL (in brackets after the
+name).
+
+The second line of each entry contains a short description of the resource.
+
+## Installing known resources
+
+You can install resources with the `ocrd resmgr download` command. It expects
+the name of the processor as the first argument and either the name or URL of a
+resource as a second argument.
+
+Although model distribution is not currently centralised within OCR-D, we
 are working towards a central model repository.
 
-In the meantime, this guide will show you, for each OCR engine:
+For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`:
 
-  * which types of models are supported
-  * where to store models locally
-  * which currently available models we recommend
-  * how to invoke the resp. OCR-D wrapper for the engine with a specific model
+```
+ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
+# or
+ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz
+```
 
-## Tesseract / ocrd_tesserocr
+This will look up the resource in the [bundled resource and user databases](#user-database), download,
+unarchive (where applicable) and store it in the [proper location](#where-is-the-data).
 
-Tesseract models are single files with a `.traineddata` extension.
 
-Tesseract expects models to be in a directory `tessdata` within what Tesseract
-calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that
-location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source
-using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched
-at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the
-locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you
-want the models location to be `$HOME/tessdata`, you can by adding to your
-`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`.
-
-We recommend you download the following models, either by downloading and
-saving to the right location or by running `make install-models-tesseract` when
-using `ocrd_all`:
-
-  * [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata)
-  * [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata)
-  * [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata)
-  * [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata)
-  * [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata)
-  * [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata)
-  * [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata)
-  * [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata)
-
-If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install
-standard models like `deu` or `script/Fraktur` with `apt`:
+**NOTE:** The special name `*` can be used instead of a resource name/url to
+download *all* known resources for this processor. To download all tesseract models:
 
 ```sh
-sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak
+ocrd resmgr download ocrd-tesserocr-recognize '*'
 ```
 
-**NOTE:** When installing with `apt`, he `script/*` models are installed
-without the `script/` prefix, so `script/Latin` becomes just `Latin`,
-`script/Fraktur` becomes `Fraktur` etc.
+(Note that `*` must be in quotes or escaped to avoid wildcard expansion in the shell.)
 
-OCR-D's Tesseract wrapper,
-[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
-specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
-model(s) to be provided as the `model` parameter. Multiple models can be
-combined by concatenating with `+` (which generally improves accuracy but always slows processing):
+## Installing unknown resources
+
+If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`:
+
+To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`.
+
+```
+ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata
+```
+
+This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the
+[user database](#user-database).  You can then use it as the parameter value for the `model` parameter:
+
+```
+ocrd-tesserocr-recognize -P model mymodel
+```
+
+## List installed resources
+
+The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead
+of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing
+resources and lists URL and description if a database entry exists.
+
+## User database
+
+Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install
+a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at
+`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist.
+
+This allows you to use the OCR-D/core resource manager mechanics, including
+lookup of known resources by name or URL, without relying (only) on the
+database maintained by the OCR-D/core developers.
+
+**NOTE:** If you produced or found resources that are interesting for the wider
+OCR(-D) community, please tell us in the [OCR-D gitter
+chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database.
+
+## Where is the data
+
+The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters)
+
+In order of preference, a resource `<name>` for a processor `ocrd-foo` is searched at:
+
+* `$VIRTUAL_ENV/share/ocrd-resources/ocrd-foo/<name>`
+* `$HOME/.config/ocrd-resources/ocrd-foo/<name>`
+* `$HOME/.local/share/ocrd-resources/ocrd-foo/<name>`
+* `$HOME/.cache/ocrd-resources/ocrd-foo/<name>`
+* `$PWD/ocrd-resources/ocrd-foo/<name>`
+
+We recommend using the `$VIRTUAL_ENV` location, which is also the default. But
+you can override the location to store data with the `--location` option, which can
+be `cwd`, `virtualenv`, `config`, `data` and `cache` resp.
 
 ```sh
-# Use the deu and frk models
-ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}'
-# Use the script/Fraktur model
-ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}'
+# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
+ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth
+# will download to $HOME.cache/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
+ocrd resmgr download --location cache ocrd-anybaseocr-dewarp latest_net_G.pth
 ```
 
-## Ocropy / ocrd_cis
+## Changing the default resource directory
 
-An Ocropy model is simply the neural network serialized with Python's pickle
-mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
-extension.
+The `$VIRTUAL_ENV` default location is reasonable because we heavily advertise
+using virtual environments and is compatible with
+[ocrd_all](https://github.com/OCR-D/ocrd_all).
+
+However, there are use cases where the `config`/`data/`/`cache` or even the
+`cwd` option should be the default (or only) location to store resources and
+resolve file parameters.
 
-Ocropy has a rather convoluted algorithm to look up models, so we recommend you
-explicitly set the `OCROPUS_DATA` variable to point to the directory with
-ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following
-to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`.
+To change the default location, adapt the `$HOME/.config/ocrd/config.yml` file
+(it is created if it doesn't exist whenever you execute `ocrd resmgr`) which
+has a `resource_location` key that accepts the same range of values as the
+`ocrd resmgr --location` command line flag.
 
-We recommend you download the following models, either by downloading and
-saving to the right location or by running `make install-models-ocropus` when
-using `ocrd_all`:
 
-  * [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz)
-  * [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz)
-  * [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`)
-  * [@chreul's  LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
+## Notes on specific processors
 
+## Ocropy / ocrd_cis
+
+An Ocropy model is simply the neural network serialized with Python's pickle
+mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
+extension and can be used as such, no need to unarchive.
 
-To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
+To use a specific model with OCR-D's ocropus wrapper in
+[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the
+`ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
 
 ```sh
-ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}'
+# Model will be downloaded on-demand if it is not locally available yet
+ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz
 ```
 
 ## Calamari / ocrd_calamari
 
 Calamari models are Tensorflow model directories. For distribution, this
 directory is usually packed to a tarball or ZIP file. Once downloaded, these
-containers must be unpacked to a directory again.
-
-As calamari does not have a model discovery setup, you must always provide the
-path with a wildcard listing all `*.ckpt.json` ("checkpoint") files.
-
-We recommend you download the following model, either by downloading and
-unpacking manually or by using `make install-models-calamari` if using
-`ocrd_all`:
-
-  * [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
+containers must be unpacked to a directory again. `ocrd resmgr` handles this
+for you, so you just need the name of the resource in the database.
 
 The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models).
 
 To use a specific model with OCR-D's calamari wrapper
 [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically,
-the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter:
+the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter:
 
 ```sh
-ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}'
+# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA
+# To use your own trained model
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir
+# or, to be able to control which checkpoints to use:
+ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json'
 ```
 
+## Tesseract / ocrd_tesserocr
+
+Tesseract models are single files with a `.traineddata` extension.
+
+Since tesseract only supports model lookup in a single directory, models should
+only be stored in a single location. If the default location (`virtualenv`) is
+not the place you want to use for tesseract models, consider [changing the default location
+in the OCR-D config file](#changing-the-default-resource-directory).
+
+**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors
+reuse the resource directory for `ocrd-tesserocr-recognize`.
+
+If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors
+are called, it will be the location to look for resources instead of the default.
+
+OCR-D's Tesseract wrapper,
+[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
+specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
+model(s) to be provided as the `model` parameter. Multiple models can be
+combined by concatenating with `+` (which generally improves accuracy but always slows processing):
+
+```sh
+# Use the deu and frk models
+ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk'
+# Use the Fraktur model
+ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur
+```
+
+# Models and docker
+
+We recommend a two-step process to make models available in Docker. First
+download all the models that you want to use on the host system. When running
+the docker container, mount that local directory into the container alongside
+the data you want to process.
+
+Download the models to `$HOME/.local/share/ocrd-resources`:
+
+```sh
+ocrd resmgr download --location data ocrd-tesserocr-recognize eng.traineddata
+ocrd resmgr download --location data ocrd-calamari-recognize default
+# ...
+```
+
+Run the `ocrd_all` Docker container:
+
+```sh
+docker run --user $(id -u) --workdir /data \
+  --volume $PWD:/data \
+  --volume $HOME/.local/cache/ocrd-resources:/ocrd-resources \
+  ocrd_all ocrd-tesserocr-recognize -I IN -O OUT -P model eng
+```
+
+
 # Model training
 
 With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the