-
Notifications
You must be signed in to change notification settings - Fork 7
Intro ocrd workspace CLI
Once you have set up OCR-D you can start exploring the command
line tools. One particular versatile command is ocrd workspace
which will
be introduced in this article.
All OCR-D processors and all other command line tools produced in
OCR-D support the --help
flag. If you are unsure what a command does
or how it is invoked, try <command> --help
, e.g. for ocrd workspace
:
$ ocrd workspace --help
Usage: ocrd workspace [OPTIONS] COMMAND [ARGS]...
Working with workspace
Options:
-d, --directory WORKSPACE_DIR Changes the workspace folder location.
[default: .]
-M, --mets-basename TEXT The basename of the METS file. [default:
mets.xml]
--backup Backup mets.xml whenever it is saved.
--help Show this message and exit.
Commands:
add Add a file or http(s) URL FNAME to METS in a workspace.
backup Backing and restoring workspaces - dev edition
bulk-add Add files in bulk to an OCR-D workspace.
clone Create a workspace from METS_URL and return the directory...
find Find files.
get-id Get METS id if any
init Create a workspace with an empty METS file in DIRECTORY.
list-group List fileGrp USE attributes
list-page List physical page IDs
prune-files Removes mets:files that point to non-existing local files...
remove Delete files (given by their ID attribute ``ID``).
remove-group Delete fileGrps (given by their USE attribute ``GROUP``).
set-id Set METS ID.
validate Validate a workspace METS_URL can be a URL, an absolute
path...
A workspace in OCR-D's terminology is a directory with a METS file, referencing
PAGE-XML and images. Processors work on a workspace, you can create (init
) a new
workspace from scratch or download (clone
) an existing METS file.
If you only have a directory of digitized images and no METS, you can create
one with ocrd workspace init
.
- Create a new mets.xml in the current directory:
ocrd workspace init
- Create a new mets.xml at /path/to/ws1:
ocrd workspace --directory /path/to/ws1 init
To create a local workspace from a remote METS document, use the ocrd workspace clone
command:
- Create a workspace in the current directory from a remote METS URL:
ocrd workspace clone 'https://remote.host/mets.xml'
- Create a workspace at
/tmp/ws
from a remote METS URL:
ocrd workspace --directory /tmp/ws clone https...
- Create a workspace at
/tmp/ws
from a remote URL, also downloading all files referenced in the METS:
ocrd workspace --directory /tmp/ws clone --download https...
Once you have a workspace, i.e. a mets.xml
file in place, you can use the
command line tools to do all kinds of operations on the workspace. Here we'll
focus on how to add files to the workspace.
Files can be added with the ocrd workspace add
command. When adding a file,
there are a number of required parameters to describe the file, as per its help:
ocrd workspace add --help
Usage: ocrd workspace add [OPTIONS] FNAME
Add a file or http(s) URL FNAME to METS in a workspace. If FNAME is not an
http(s) URL and is not a workspace-local existing file, try to copy to
workspace.
Options:
-G, --file-grp TEXT fileGrp USE [required]
-i, --file-id TEXT ID for the file [required]
-m, --mimetype TEXT Media type of the file [required]
-g, --page-id TEXT ID of the physical page
-C, --check-file-exists Whether to ensure FNAME exists
--ignore Do not check whether file exists.
--force If file with ID already exists, replace it. No
effect if --ignore is set.
--help Show this message and exit.
The -G
, -i
and -m
options are required and we strongly recommend you also
always set the --page-id
for a file to make it clear which page the file
being added represents.
For example, let's assume you have a PNG image at /tmp/IMG_0001
, representing
the first physical page of the work. You can add that to the empty workspace at /tmp/ws
we created before, like so:
ocrd workspace --directory /tmp/ws add -i FILE_0001_PNG -G OCR-D-IMG -m image/png -g PHYS_0001 /tmp/IMG_0001.png
Here's a breakdown of the options
-
-i FILE_0001_PNG
: This must be a unique identifier of a file. Make sure to make it specific enough, e.g. by adding the format (as we're doing here) or with some other consistent naming scheme -
-G OCR-D-IMG
: Add this file to theOCR-D-IMG
file group (or create it if it does not yet exist). The names ofmets:fileGrp
are completely arbitrary. By convention, OCR-D usesOCR-D-IMG
for the unprocessed images in the Ground Truth but other conventions call this file groupMAX
(for MAXimum resolution). -
-m image/png
: The media type or MIME type of the file,image/png
for PNG images,image/tiff
for TIFF, etc. -
-g PHYS_0001
: The ID of the physical page, i.e. themets:div
that groups together all files representing the same page. The identifier is arbitrary as well, though many METS tools chosePHYS_
+ numerical identifier as the naming convention. Note that -i and -g always have to start with a letter After running the command, you will find the file copied into the workspace at<fileGrp>/<ID>.ext
, so in this case atOCR-D-IMG/IMG_0001.png
.
To make sure that the file was added, you can use the ocrd workspace find
command:
ocrd workspace find -k ID -k fileGrp -k mimetype -k pageId
FILE_0001_PNG OCR-D-IMG image/png PHYS_0001
If you need to add many files at once, there is a command ocrd workspace bulk-add
that is orders of magnitude faster than calling ocrd workspace add
for each and every file. See ocrd workspace bulk-add --help
for an example
how to use it and if you encounter problems, as always, ask for support in the
OCR-D chat.
Lastly, we want to look at the ocrd workspace validate
command that makes assertions
on the validity, syntactical correctness and spec compliance
of all parts of a workspace: METS, PAGE-XML and images.
Initially developed for producing and validating Ground Truth, it has many checks that can help you find problems in your data
ocrd workspace validate --help
Usage: ocrd workspace validate [OPTIONS] [METS_URL]
Validate a workspace
METS_URL can be a URL, an absolute path or a path relative to $PWD. If not
given, use the concatenation of --directory and --mets-basename.
Check that the METS and its referenced file contents abide by the OCR-D
specifications.
Options:
-a, --download Download all files
-s, --skip [imagefilename|dimension|mets_unique_identifier|mets_file_group_names|mets_files|pixel_density|page|page_xsd|mets_xsd|url]
Tests to skip
--page-textequiv-consistency, --page-strictness [strict|lax|fix|off]
How strict to check PAGE multi-level
textequiv consistency
--page-coordinate-consistency [poly|baseline|both|off]
How fierce to check PAGE multi-level
coordinate consistency
--help Show this message and exit.
If you don't provide any options, the validation will run all checks on your workspace and report any issues, grouped by severity ("notice", "warning", "error") as a simple XML document that is both human-readable and machine-readable.
To skip certain tests that are not helpful for you, use the --skip
option.
For example, if your images have incorrect DPI
metadata, you can omit those
messages with --skip pixel_density
.
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows