Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update models docs to core 2.22.0 #196

Merged
merged 21 commits into from
Jan 26, 2021
Merged
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 207 additions & 74 deletions site/en/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,118 +12,251 @@ its own internal format(s) for models. Some support central storage of models
at a specific location (tesseract, ocropy, kraken) while others require the full
path to a model (calamari).

Likewise, model distribution is not currently centralised within OCR-D though we
Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core
comes with a framework for managing processor resources uniformly. This means
that processors can delegate to OCR-D/core to resolve specific file resources by name,
looking in well-defined places in the filesystem. This also includes downloading and caching
file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database
of known resources, such as models, dictionaries, configurations and other
processor-specific data files. This means that OCR-D users should be able to
concentrate on fine-tuning their OCR workflows and not bother with implementation
details like "where do I get models from and where do I put them".
In particular, users can reference file parameters by name now.

kba marked this conversation as resolved.
Show resolved Hide resolved
All of the above mentioned functionality can be accessed using the `ocrd
resmgr` command line tool.

## What models are available?

To get a list of the resources that the OCR-D/core [is aware
of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml):

```
ocrd resmgr list-available
```

The output will look similar to this:

```

ocrd-calamari-recognize
- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz)
Calamari model trained with GT4HistOCR
- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
Calamari model trained with GT4HistOCR

ocrd-cis-ocropy-recognize
- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
ocropy historical latin model by github.com/chreul
```

As you can see, resources are grouped by the processors which make use of them.

The word after the list symbol, e.g. `qurator-gt4hist-0.3`,
`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can
use in parameters without having to specify the full URL (in brackets after the
name).

The second line of each entry contains a short description of the resource.

## Installing known resources

You can install resources with the `ocrd resmgr download` command. It expects
the name of the processor as the first argument and either the name or URL of a
resource as a second argument.

Although model distribution is not currently centralised within OCR-D, we
are working towards a central model repository.

In the meantime, this guide will show you, for each OCR engine:
For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`:

* which types of models are supported
* where to store models locally
* which currently available models we recommend
* how to invoke the resp. OCR-D wrapper for the engine with a specific model
```
ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz
# or
ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz
```

## Tesseract / ocrd_tesserocr
This will look up the resource in the [bundled resource and user databases](#user-database), download,
unarchive (where applicable) and store it in the [proper location](#where-is-the-data).

Tesseract models are single files with a `.traineddata` extension.

Tesseract expects models to be in a directory `tessdata` within what Tesseract
calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that
location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source
using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched
at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the
locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you
want the models location to be `$HOME/tessdata`, you can by adding to your
`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`.

We recommend you download the following models, either by downloading and
saving to the right location or by running `make install-models-tesseract` when
using `ocrd_all`:

* [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata)
* [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata)
* [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata)
* [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata)
* [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata)
* [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata)
* [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata)
* [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata)

If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install
standard models like `deu` or `script/Fraktur` with `apt`:
**NOTE:** The special name `*` can be used instead of a resource name/url to
download *all* known resources for this processor. To download all tesseract models:

```sh
sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak
ocrd resmgr download ocrd-tesserocr-recognize '*'
```

**NOTE:** When installing with `apt`, he `script/*` models are installed
without the `script/` prefix, so `script/Latin` becomes just `Latin`,
`script/Fraktur` becomes `Fraktur` etc.
(Note that `*` must be in quotes or escaped to avoid wildcard expansion in the shell.)
kba marked this conversation as resolved.
Show resolved Hide resolved

OCR-D's Tesseract wrapper,
[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
model(s) to be provided as the `model` parameter. Multiple models can be
combined by concatenating with `+` (which generally improves accuracy but always slows processing):
## Installing unknown resources

If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`:

To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`.

```
ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata
```

This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the
[user database](#user-database). You can then use it as the parameter value for the `model` parameter:

```
ocrd-tesserocr-recognize -P model mymodel
```

## List installed resources

The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead
of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing
resources and lists URL and description if a database entry exists.

## User database

Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install
a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at
`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist.

This allows you to use the OCR-D/core resource manager mechanics, including
lookup of known resources by name or URL, without relying (only) on the
database maintained by the OCR-D/core developers.

**NOTE:** If you produced or found resources that are interesting for the wider
OCR(-D) community, please tell us in the [OCR-D gitter
chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database.

## Where is the data

The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters)

In order of preference, a resource `<name>` for a processor `ocrd-foo` is searched at:

* `$VIRTUAL_ENV/share/ocrd-resources/ocrd-foo/<name>`
* `$HOME/.config/ocrd-resources/ocrd-foo/<name>`
* `$HOME/.local/share/ocrd-resources/ocrd-foo/<name>`
* `$HOME/.cache/ocrd-resources/ocrd-foo/<name>`
* `$PWD/ocrd-resources/ocrd-foo/<name>`

We recommend using the `$VIRTUAL_ENV` location, which is also the default. But
you can override the location to store data with the `--location` option, which can
be `cwd`, `virtualenv`, `config`, `data` and `cache` resp.
kba marked this conversation as resolved.
Show resolved Hide resolved

```sh
# Use the deu and frk models
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}'
# Use the script/Fraktur model
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}'
# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth
# will download to $HOME.cache/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth
ocrd resmgr download --location cache ocrd-anybaseocr-dewarp latest_net_G.pth
kba marked this conversation as resolved.
Show resolved Hide resolved
```

## Ocropy / ocrd_cis
## Changing the default resource directory

An Ocropy model is simply the neural network serialized with Python's pickle
mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
extension.
The `$VIRTUAL_ENV` default location is reasonable because we heavily advertise
using virtual environments and is compatible with
[ocrd_all](https://github.com/OCR-D/ocrd_all).

However, there are use cases where the `config`/`data/`/`cache` or even the
`cwd` option should be the default (or only) location to store resources and
resolve file parameters.
kba marked this conversation as resolved.
Show resolved Hide resolved

Ocropy has a rather convoluted algorithm to look up models, so we recommend you
explicitly set the `OCROPUS_DATA` variable to point to the directory with
ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following
to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`.
To change the default location, adapt the `$HOME/.config/ocrd/config.yml` file
(it is created if it doesn't exist whenever you execute `ocrd resmgr`) which
has a `resource_location` key that accepts the same range of values as the
`ocrd resmgr --location` command line flag.
kba marked this conversation as resolved.
Show resolved Hide resolved

We recommend you download the following models, either by downloading and
saving to the right location or by running `make install-models-ocropus` when
using `ocrd_all`:

* [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz)
* [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz)
* [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`)
* [@chreul's LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz)
## Notes on specific processors

## Ocropy / ocrd_cis

An Ocropy model is simply the neural network serialized with Python's pickle
mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz`
extension and can be used as such, no need to unarchive.

To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter:
To use a specific model with OCR-D's ocropus wrapper in
[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the
`ocrd-cis-ocropy-recognize` processor, use the `model` parameter:

```sh
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}'
# Model will be downloaded on-demand if it is not locally available yet
ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz
```

## Calamari / ocrd_calamari

Calamari models are Tensorflow model directories. For distribution, this
directory is usually packed to a tarball or ZIP file. Once downloaded, these
containers must be unpacked to a directory again.

As calamari does not have a model discovery setup, you must always provide the
path with a wildcard listing all `*.ckpt.json` ("checkpoint") files.

We recommend you download the following model, either by downloading and
unpacking manually or by using `make install-models-calamari` if using
`ocrd_all`:

* [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)
containers must be unpacked to a directory again. `ocrd resmgr` handles this
for you, so you just need the name of the resource in the database.

The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models).

To use a specific model with OCR-D's calamari wrapper
[ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically,
the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter:
the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter:

```sh
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}'
# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA
# To use your own trained model
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir
# or, to be able to control which checkpoints to use:
ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json'
```
kba marked this conversation as resolved.
Show resolved Hide resolved

## Tesseract / ocrd_tesserocr

Tesseract models are single files with a `.traineddata` extension.

Since tesseract only supports model lookup in a single directory, models should
only be stored in a single location. If the default location (`virtualenv`) is
not the place you want to use for tesseract models, consider [changing the default location
in the OCR-D config file](#changing-the-default-resource-directory).

**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors
reuse the resource directory for `ocrd-tesserocr-recognize`.

If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors
are called, it will be the location to look for resources instead of the default.

OCR-D's Tesseract wrapper,
[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more
specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the
model(s) to be provided as the `model` parameter. Multiple models can be
combined by concatenating with `+` (which generally improves accuracy but always slows processing):

```sh
# Use the deu and frk models
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk'
# Use the Fraktur model
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur
```

# Models and docker
kba marked this conversation as resolved.
Show resolved Hide resolved

We recommend a two-step process to make models available in Docker. First
download all the models that you want to use on the host system. When running
the docker container, mount that local directory into the container alongside
the data you want to process.

Download the models to `$HOME/.local/share/ocrd-resources`:

```sh
ocrd resmgr download --location data ocrd-tesserocr-recognize eng.traineddata
ocrd resmgr download --location data ocrd-calamari-recognize default
# ...
```

Run the `ocrd_all` Docker container:

```sh
docker run --user $(id -u) --workdir /data \
--volume $PWD:/data \
--volume $HOME/.local/cache/ocrd-resources:/ocrd-resources \
ocrd_all ocrd-tesserocr-recognize -I IN -O OUT -P model eng
```
kba marked this conversation as resolved.
Show resolved Hide resolved


# Model training

With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the
Expand Down