Intro Docker

Introduction to Docker in OCR-D

Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. Wikipedia "Docker"

Docker allows bundling an operating system with application software into containers that can be executed on a multitude of host operating systems, including Linux, Windows and Mac OS X.

Since these containers are fully self-contained, the host OS only needs the Docker software or a Docker-compatible container tool like Podman. As such, Docker is the simplest and most consistent way to deploy ocrd_all.

This article aims to introduce the most important concepts, how you can use Docker for OCR-D processing, some handy commands and tips to get started and an outlook how we will use Docker in the future.

Basics

Terminology

A Docker image is the "blueprint" from which instances, called containers can be started and used. A Docker image is defined by "Dockerfile" that defines which base image is used - in OCR-D this will be Ubuntu 18.04 for the foreseeable future.

Images can come in different variants, called tags, which in most cases represent different software included.

Docker containers are supposed to be ephemeral, i.e. Docker is optimized for quickly starting up containers, do a task (such as OCR-D processing or starting a continuous server process) and then be shut down and removed. But it's perfectly possible to treat a Docker container as a Virtual Machine and "log into a docker container" to familiarize yourself with the OCR-D command line interface and the available procecssors.

We distinguish slim containers that contain the processors of a single project, such as ocrd/calamari and fat containers that contain many or even all OCR-D projects, such as ocrd/all.

Is Docker installed?

First of all, make sure that Docker is installed. The following command should print the Docker engine version:

$ docker --version
Docker version 19.03.12, build 48a66213fe

`docker pull`

To retrieve an image from Docker Hub, the central repository of Docker images, you can use the docker pull command:

docker pull ocrd/all:minimum

Since the updates to ocrd_all are released in short intervals, we recommend to regularly run docker pull for the OCR-D images, maybe even in a cron job.

Basic `docker run`

The main docker command you will be dealing with is docker run. As the name suggests, this executes a container, optionally pulling the corresponding image from Docker Hub first if not available locally.

In the simplest case, a docker run command looks like this:

docker run <image-name> [command]

For example to get the help message of ocrd-olena-binarize:

docker run ocrd/all:minimum ocrd-olena-binarize --help

Advanced `docker run`

docker run supports a multitude of command line options, here is a list of the most important ones, that you should consider setting:

--user / -u: Sets the user id/group of the user the container should be run as. In most cases, this should be your own user/group.
--rm: Removes the container after executing. Without this flag, stopped containers will start piling up and eating up HDD space.
--volume / --mount: These flags allow mounting data from the host system into the container. The --volume has a simpler syntax for the use case of mounting a host directory to a container directory (--volume /host/path:/container/path) whereas --mount is more powerful and versatile.
-i / -t: Marks the container run to be interactive (-i) and connects a terminal (-t): These flags go together and are a requisite for interactive sessions.

A reasonable call to binarize all images in a workspace in the current working directory with ocrd_olena (line breaks for readability):

docker run --rm --user $(id -u):$(id -g) \
    --volume $PWD:/data ocrd/all:minimum \
    ocrd-olena-binarize -I FULL -O BIN -g page1

ocrd_all image tags

ocrd_all builds various docker image tags for every release, based on Ubuntu 18.04:

minimum/medium/maximum: These denote how many of the available processors in ocrd_all are included in this image. See the comparison table in ocrd_all for a rundown of the variants
Tags containing -git retain the Git repositories of the projects, so you can update an existing container from within
Tags containing -cuda are built using the NVIDIA CUDA Ubuntu 18.04 base image to allow for GPU processing inside OCR-D containers

We recommend using the maximum or maximum-git tags unless you don't need all the processors contained or HDD space is an issue.

Tips and Tricks

Log into a container

The docker run command line syntax can be daunting and error-prone, especially when you're just starting out with OCR-D processors.

To make it easier to follow the user guide and workflow guide, you can "log into" a Docker container by running a shell and execute commands just like you would with a host-native installation:

$ docker run -it ocrd/all bash

Note however, that while useful for getting started, you should not run Docker containers interactively in a productive system.

Mounting data

The simplest way to mount data into a container is the --volume flag, that allows mapping host directory to container directory. By convention, we use the /data path inside Docker containers for data processing, so for most use cases, it should suffice to mount the workspace directory on the host to the /data directory in the container:

$ docker run --volume /path/to/workspace:/data [...]

You can however also mount individual files by using the --mount option, which is particularly useful to mount models, configuration files etc. to a specific location in the container. For example to mount a specific tesseract model to the right location for ocrd_tesserocr:

docker run --mount type=bind,source=/path/to/model.traineddata,destination=/usr/share/tesseract-ocr/4.00/tessdata/model.traineddata [...]

Cleaning up

If you do not use the --rm flag consistently, you will end up with stopped containers and obsolete images over time.

The simplest way to clean up everything - dangling images, stopped containers, unused network bridges etc. - is the docker system prune command. It will ask for confirmation and show you how much HDD space was reclaimed.

Outlook: Docker in OCR-D phase III

Here's a very brief summary of changes to the Docker strategy in OCR-D that we will focus on in the coming weeks and months:

Improve documentation related to Docker on https://ocr-d.de
Properly usable CUDA images: We currently only support CUDA 11 which is neither compatible with all CUDA host toolkits nor with all tensorflow versions.
Introduce container composion, e.g. with docker-compose once the OCR-D Web API and implementations will mature.
Phase out slim images in most cases - ocrd_all should be the one-stop-shop for OCR-D docker images

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intro Docker

Introduction to Docker in OCR-D

Basics

Terminology

Is Docker installed?

`docker pull`

Basic `docker run`

Advanced `docker run`

ocrd_all image tags

Tips and Tricks

Log into a container

Mounting data

Cleaning up

Outlook: Docker in OCR-D phase III

Clone this wiki locally

Intro Docker

Introduction to Docker in OCR-D

Basics

Terminology

Is Docker installed?

docker pull

Basic docker run

Advanced docker run

ocrd_all image tags

Tips and Tricks

Log into a container

Mounting data

Cleaning up

Outlook: Docker in OCR-D phase III

Clone this wiki locally

`docker pull`

Basic `docker run`

Advanced `docker run`