Skip to content

Commit

Permalink
Merge pull request #196 from OCR-D/update-models-2.22.0
Browse files Browse the repository at this point in the history
Update models docs to core 2.22.0
  • Loading branch information
kba authored Jan 26, 2021
2 parents 7ea67e8 + f1ba884 commit 3d25337
Showing 1 changed file with 210 additions and 74 deletions.
284 changes: 210 additions & 74 deletions site/en/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,118 +12,254 @@ its own internal format(s) for models. Some support central storage of models
at a specific location (tesseract, ocropy, kraken) while others require the full
path to a model (calamari).

Likewise, model distribution is not currently centralised within OCR-D though we
Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core
comes with a framework for managing processor resources uniformly. This means
that processors can delegate to OCR-D/core to resolve specific file resources by name,
looking in well-defined places in the filesystem. This also includes downloading and caching
file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database
of known resources, such as models, dictionaries, configurations and other
processor-specific data files. This means that OCR-D users should be able to
concentrate on fine-tuning their OCR workflows and not bother with implementation
details like "where do I get models from and where do I put them".
In particular, users can reference file parameters by name now.

All of the above mentioned functionality can be accessed using the `ocrd
resmgr` command line tool.

## What models are available?

To get a list of the resources that the OCR-D/core [is aware
of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):

```
ocrd resmgr list-available
```

The output will look similar to this:

```
ocrd-calamari-recognize
- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz)
Calamari model trained with GT4HistOCR
- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
Calamari model trained with GT4HistOCR
ocrd-cis-ocropy-recognize
- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
ocropy historical latin model by github.com/chreul
```

As you can see, resources are grouped by the processors which make use of them.

The word after the list symbol, e.g. `qurator-gt4hist-0.3`,
`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can
use in parameters without having to specify the full URL (in brackets after the
name).

The second line of each entry contains a short description of the resource.

## Installing known resources

You can install resources with the `ocrd resmgr download` command. It expects
the name of the processor as the first argument and either the name or URL of a
resource as a second argument.

Although model distribution is not currently centralised within OCR-D, we
are working towards a central model repository.

In the meantime, this guide will show you, for each OCR engine:
For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`:

* which types of models are supported
* where to store models locally
* which currently available models we recommend
* how to invoke the resp. OCR-D wrapper for the engine with a specific model
```
ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
# or
ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz
```

## Tesseract / ocrd_tesserocr
This will look up the resource in the [bundled resource and user databases](#user-database), download,
unarchive (where applicable) and store it in the [proper location](#where-is-the-data).

Tesseract models are single files with a `.traineddata` extension.

Tesseract expects models to be in a directory `tessdata` within what Tesseract
calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that
location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source
using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched
at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the
locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you
want the models location to be `$HOME/tessdata`, you can by adding to your
`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`.

We recommend you download the following models, either by downloading and
saving to the right location or by running `make install-models-tesseract` when
using `ocrd_all`:

* [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata)
* [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata)
* [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata)
* [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata)
* [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata)
* [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata)
* [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata)
* [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata)

If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install
standard models like `deu` or `script/Fraktur` with `apt`:
**NOTE:** The special name `*` can be used instead of a resource name/url to
download *all* known resources for this processor. To download all tesseract models:

```sh
sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak
ocrd resmgr download ocrd-tesserocr-recognize '*'
```

**NOTE:** When installing with `apt`, he `script/*` models are installed
without the `script/` prefix, so `script/Latin` becomes just `Latin`,
`script/Fraktur` becomes `Fraktur` etc.
**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource
to download *all* known resources for *all* installed processors:

OCR-D's Tesseract wrapper,
[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
model(s) to be provided as the `model` parameter. Multiple models can be
combined by concatenating with `+` (which generally improves accuracy but always slows processing):
ocrd resmgr download '*'

(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.)

## Installing unknown resources

If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`:

To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`.

```
ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata
```

This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the
[user database](#user-database). You can then use it as the parameter value for the `model` parameter:

```
ocrd-tesserocr-recognize -P model mymodel
```

## List installed resources

The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead
of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing
resources and lists URL and description if a database entry exists.

## User database

Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install
a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at
`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist.

This allows you to use the OCR-D/core resource manager mechanics, including
lookup of known resources by name or URL, without relying (only) on the
database maintained by the OCR-D/core developers.

**NOTE:** If you produced or found resources that are interesting for the wider
OCR(-D) community, please tell us in the [OCR-D gitter
chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database.

## Where is the data

The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters)

In order of preference, a resource `<name>` for a processor `ocrd-foo` is searched at:

* `$PWD/ocrd-resources/ocrd-foo/<name>`
* `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/<name>`
* `/usr/local/share/ocrd-resources/ocrd-foo/<name>`

(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset).

We recommend using the `$XDG_DATA_HOME` location, which is also the default. But
you can override the location to store data with the `--location` option, which can
be `cwd`, `data` and `system` resp.

```sh
# Use the deu and frk models
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}'
# Use the script/Fraktur model
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}'
# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth
# will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth
```

## Ocropy / ocrd_cis
## Changing the default resource directory

An Ocropy model is simply the neural network serialized with Python's pickle
mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
extension.
The `$XDG_DATA_HOME` default location is reasonable because
models are usually large files which should persist across different deployments,
both native and containerized, both single-module and [ocrd_all](https://github.com/OCR-D/ocrd_all).
Moreover, that variable can easily be overridden during installation.

However, there are use cases where `system` or even `cwd` should be
used as location to store resources, hence the `--location` option.

Ocropy has a rather convoluted algorithm to look up models, so we recommend you
explicitly set the `OCROPUS_DATA` variable to point to the directory with
ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following
to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`.

We recommend you download the following models, either by downloading and
saving to the right location or by running `make install-models-ocropus` when
using `ocrd_all`:

* [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz)
* [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz)
* [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`)
* [@chreul's LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
## Notes on specific processors

## Ocropy / ocrd_cis

To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
An Ocropy model is simply the neural network serialized with Python's pickle
mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
extension and can be used as such, no need to unarchive.

To use a specific model with OCR-D's ocropus wrapper in
[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the
`ocrd-cis-ocropy-recognize` processor, use the `model` parameter:

```sh
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}'
# Model will be downloaded on-demand if it is not locally available yet
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz
```

## Calamari / ocrd_calamari

Calamari models are Tensorflow model directories. For distribution, this
directory is usually packed to a tarball or ZIP file. Once downloaded, these
containers must be unpacked to a directory again.

As calamari does not have a model discovery setup, you must always provide the
path with a wildcard listing all `*.ckpt.json` ("checkpoint") files.

We recommend you download the following model, either by downloading and
unpacking manually or by using `make install-models-calamari` if using
`ocrd_all`:

* [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
containers must be unpacked to a directory again. `ocrd resmgr` handles this
for you, so you just need the name of the resource in the database.

The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models).

To use a specific model with OCR-D's calamari wrapper
[ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically,
the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter:
the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter:

```sh
# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA
# To use your own trained model
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir
# or, to be able to control which checkpoints to use:
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json'
```

## Tesseract / ocrd_tesserocr

Tesseract models are single files with a `.traineddata` extension.

Since tesseract only supports model lookup in a single directory, models should
only be stored in a single location. If the default location (`virtualenv`) is
not the place you want to use for tesseract models, consider [changing the default location
in the OCR-D config file](#changing-the-default-resource-directory).

**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors
reuse the resource directory for `ocrd-tesserocr-recognize`.

If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors
are called, it will be the location to look for resources instead of the default.

OCR-D's Tesseract wrapper,
[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
model(s) to be provided as the `model` parameter. Multiple models can be
combined by concatenating with `+` (which generally improves accuracy but always slows processing):

```sh
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}'
# Use the deu and frk models
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk'
# Use the Fraktur model
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur
```

# Models and Docker

We recommend keeping all downloaded resources in a persistent host directory,
separate of the `ocrd/*` Docker container and data directory, and mounting that
resource directory into a specific path in the container alongside the data directory.
The host resource directory can be empty initially. Each time you run the Docker container,
your processors will access the host directory to resolve resources, and you can download
additional models into that location using `ocrd resmgr`.

The following will assume (without loss of generality) that your host-side data
path is under `./data`, and the host-side resource path is under `./models`:

- To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in Docker:
docker run --user $(id -u) \
--volume $PWD/models:/usr/local/share/ocrd-resources \
ocrd/all \
ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \
ocrd resmgr download ocrd-calamari-recognize default\; \
...
- To run processors, as usual do:
docker run --user $(id -u) --workdir /data \
--volume $PWD/data:/data \
--volume $PWD/models:/usr/local/share/ocrd-resources \
ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng

This principle applies to all `ocrd/*` Docker images, e.g. you can replace `ocrd/all` above with `ocrd/tesserocr` as well.

# Model training

With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the
Expand Down

0 comments on commit 3d25337

Please sign in to comment.