Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 1st draft line GT/training specs #105

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,13 @@
# Specification of the technical architecture, interface definitions and data exchange format(s)

See [https://ocr-d.github.io/](https://ocr-d.github.io/).

## Line Ground Truth

* [Spec](./gt-spec.md)
* [BagIt profile](./gt-profile.yml)

## Engine training

* [Spec](./training-spec.md)
* [JSON schema](./training-schema.yml)
1 change: 1 addition & 0 deletions gt-profile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"BagIt-Profile-Info":{"BagIt-Profile-Identifier":"https://ocr-d.github.io/gt-profile.json","BagIt-Profile-Version":"1.2.0","Source-Organization":"OCR-D","External-Description":"BagIt profile for OCR line Ground Truth","Contact-Name":"Konstantin Baierer","Contact-Email":"konstantin.baierer@sbb.spk-berlin.de","Version":0.1},"Bag-Info":{"Bagging-Date":{"required":false},"Source-Organization":{"required":false},"Gt-Transcription-Extension":{"required":false,"default":".gt.txt"},"Gt-Transcription-Media-Type":{"required":false,"default":"text/plain"},"Gt-Transcription-Directory":{"required":false,"default":"text"},"Gt-Transcription-Normalization":{"required":false,"default":"NFKC","values":["NFD","NFKD","NFC","NFKC"]},"Gt-Color-Image-Extension":{"required":false,"default":".color.png"},"Gt-Color-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Color-Image-Directory":{"required":false,"default":"img"},"Gt-Grayscale-Image-Extension":{"required":false,"default":".nrm.png"},"Gt-Grayscale-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Grayscale-Image-Directory":{"required":false,"default":"grayscale"},"Gt-Bitonal-Image-Extension":{"required":false,"default":".bin.png"},"Gt-Bitonal-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Bitonal-Image-Directory":{"required":false,"default":"bin"},"Gt-Line-Metadata-Extension":{"required":false,"default":".json"},"Gt-Line-Metadata-Media-Type":{"required":false,"default":"application/json","values":["application/json","text/vnd.yaml"]},"Gt-Line-Metadata-Directory":{"required":false,"default":"meta"},"Gt-Directory":{"required":false,"default":"ground-truth"},"Gt-Directory-Structure":{"required":false,"default":"flat","values":["flat","flat-nested","subfolders","subfolders-nested"]}},"Manifests-Required":["sha512"],"Tag-Manifests-Required":[],"Tag-Files-Required":[],"Tag-Files-Allowed":["README.md"],"Allow-Fetch.txt":false,"Serialization":"allowed","Accept-Serialization":"application/zip","Accept-BagIt-Version":["1.0"]}
106 changes: 106 additions & 0 deletions gt-profile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
BagIt-Profile-Info:
BagIt-Profile-Identifier: https://ocr-d.github.io/gt-profile.json
BagIt-Profile-Version: '1.2.0'
Source-Organization: OCR-D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about information about the origin of the digitized lines?

  • minimal bibliographic record based on DC?
  • and artificially generated lines (+ degeneration)
  • what about the degeneration algorithm?

I think that comment may be in the wrong place here. It should probably be placed in this place ## Line metadata##.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/OCR-D/spec/pull/105/files/6827085d051e945062203b82ef921e54025cfbda#diff-ee256e83a17cfe309565c88ab376091a That is the definition of what's currently supposed to be in there. Bibliographic metadata would be in the METS referred to by metsUrl. How to encode provenance on a line-level I am not sure though. @VolkerHartmann?

External-Description: BagIt profile for OCR line Ground Truth
Contact-Name: Konstantin Baierer
Contact-Email: konstantin.baierer@sbb.spk-berlin.de
Version: 0.1
Bag-Info:
Bagging-Date:
required: false
Source-Organization:
required: false
Gt-Transcription-Extension:
required: false
default: '.gt.txt'
Gt-Transcription-Media-Type:
required: false
default: 'text/plain'
Gt-Transcription-Directory:
required: false
default: 'text'
Gt-Transcription-Normalization:
required: false
default: 'NFKC'
kba marked this conversation as resolved.
Show resolved Hide resolved
values:
- NFD
- NFKD
- NFC
- NFKC
Gt-Color-Image-Extension:
required: false
default: '.color.png'
Gt-Color-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
- 'image/tiff'
- 'image/jpeg'
Gt-Color-Image-Directory:
required: false
default: 'img'
Gt-Grayscale-Image-Extension:
required: false
default: '.nrm.png'
Gt-Grayscale-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a differentiation between Tiff compressed or JPEG2000 make more sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean additionally allow image/jp2? Do engines allow JPEG2000 input for training?

- 'image/tiff'
- 'image/jpeg'
Gt-Grayscale-Image-Directory:
required: false
default: 'grayscale'
Gt-Bitonal-Image-Extension:
required: false
default: '.bin.png'
Gt-Bitonal-Image-Media-Type:
required: false
default: 'image/png'
values:
- 'image/png'
- 'image/tiff'
- 'image/jpeg'
kba marked this conversation as resolved.
Show resolved Hide resolved
Gt-Bitonal-Image-Directory:
required: false
default: 'bin'
Gt-Line-Metadata-Extension:
required: false
default: '.json'
Gt-Line-Metadata-Media-Type:
required: false
default: 'application/json'
values:
- 'application/json'
- 'text/vnd.yaml'
Gt-Line-Metadata-Directory:
required: false
default: 'meta'
Gt-Directory:
required: false
default: 'ground-truth'
Gt-Directory-Structure:
required: false
default: 'flat'
values:
# img and transcription in the Gt-Directory
- 'flat'
# img and transcription in the same dir below Gt-Directory
- 'flat-nested'
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory of Gt-Directory
- 'subfolders'
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory in the same dir below Gt-Directory
- 'subfolders-nested'
Manifests-Required: ['sha512']
Tag-Manifests-Required: []
Tag-Files-Required: []
Tag-Files-Allowed:
- README.md
Allow-Fetch.txt: false
kba marked this conversation as resolved.
Show resolved Hide resolved
Serialization: allowed
Accept-Serialization: application/zip
Accept-BagIt-Version:
- '1.0'
196 changes: 196 additions & 0 deletions gt-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# linegt

> An exchange format for line-based ground truth for OCR

<!-- BEGIN-MARKDOWN-TOC -->
* [Rationale](#rationale)
* [BagIt](#bagit)
* [BagIt profile](#bagit-profile)
* [Gt-Transcription-Extension](#gt-transcription-extension)
* [Gt-Transcription-Media-Type](#gt-transcription-media-type)
* [Gt-Transcription-Directory](#gt-transcription-directory)
* [Gt-Transcription-Normalization](#gt-transcription-normalization)
* [Gt-Grayscale-Image-Extension](#gt-grayscale-image-extension)
* [Gt-Grayscale-Image-Media-Type](#gt-grayscale-image-media-type)
* [Gt-Grayscale-Image-Directory](#gt-grayscale-image-directory)
* [Gt-Color-Image-Extension](#gt-color-image-extension)
* [Gt-Color-Image-Media-Type](#gt-color-image-media-type)
* [Gt-Color-Image-Directory](#gt-color-image-directory)
* [Gt-Bitonal-Image-Extension](#gt-bitonal-image-extension)
* [Gt-Bitonal-Image-Media-Type](#gt-bitonal-image-media-type)
* [Gt-Bitonal-Image-Directory](#gt-bitonal-image-directory)
* [Gt-Line-Metadata-Extension](#gt-line-metadata-extension)
* [Gt-Line-Metadata-Media-Type](#gt-line-metadata-media-type)
* [Gt-Line-Metadata-Directory](#gt-line-metadata-directory)
* [Gt-Directory](#gt-directory)
* [Gt-Directory-Structure](#gt-directory-structure)
* [Line metadata](#line-metadata)

<!-- END-MARKDOWN-TOC -->

## Rationale

Recent OCR (optical character recognition) engines are not actually
character-based anymore but on neural networks that operate on lines. These
kba marked this conversation as resolved.
Show resolved Hide resolved
engines can be trained with images of text lines and their transcription
("ground truth"), plus engine-specific configurations.

This format defines a standardized format to bundle such ground truth, based on
the BagIt conventions.

## BagIt

An `linegt` bag must be a valid BagIt bag:

* Root folder must contain a file `bagit.txt`
* Root folder must contain a file `bag-info.txt` with metadata about the bag
* All payload files must be under a folder `/data`
* Every file in `/data` along with its `<algo>` checksum must be listed in a
file `manifest-<algo>.txt`

## BagIt profile

In addition to the requirements of BagIt, an `ocr_linegt` bag must adhere to
the `ocr_linegt` BagIt profile.

### Gt-Transcription-Extension

Extension of the transcription files. Default: `.gt.txt`.

### Gt-Transcription-Media-Type

Media type of the transcription files. Default: `text/plain`.

### Gt-Transcription-Directory

Name of the subfolder containing transcriptions if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `text`.

### Gt-Transcription-Normalization

Unicode normalization level. One of `NFC`, `NFKC`, `NFD` or `NFKC`. Default: `NFKC`.

![Illustration unicode normalization](http://unicode.org/reports/tr15/images/UAX15-NormFig6.jpg)

### Gt-Grayscale-Image-Extension

Extension of the grayscale image files. Default: `.png`.
kba marked this conversation as resolved.
Show resolved Hide resolved

### Gt-Grayscale-Image-Media-Type

Media type of the grayscale image files. Default: `image/png`.

### Gt-Grayscale-Image-Directory

Name of the subfolder containing grayscale images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `grayscale`.

### Gt-Color-Image-Extension

Extension of the color image files. Default: `.png`.
kba marked this conversation as resolved.
Show resolved Hide resolved

### Gt-Color-Image-Media-Type

Media type of the color image files. Default: `image/png`.

### Gt-Color-Image-Directory

Name of the subfolder containing color images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `img`.

### Gt-Bitonal-Image-Extension

Extension of the bitonal image files. Default: `.png`.
kba marked this conversation as resolved.
Show resolved Hide resolved

### Gt-Bitonal-Image-Media-Type

Media type of the bitonal image files. Default: `image/png`.

### Gt-Bitonal-Image-Directory

Name of the subfolder containing bitonal images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `bin`.

### Gt-Line-Metadata-Extension

Extension of the [line metadata] files. Default: `.json`.

### Gt-Line-Metadata-Media-Type

Media type of the [line metadata] files. Default: `application/json`.

### Gt-Line-Metadata-Directory

Name of the subfolder containing [line metadata] if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `meta`.

### Gt-Directory

Directory below `/data` containing the ground truth. Default: `ground-truth`.

### Gt-Directory-Structure

Directory structure. One of

- `flat`: img and transcription in the [`Gt-Directory`]
- `flat-nested`: img and transcription in the same dir below [`Gt-Directory`]
- `subfolders`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] of [`Gt-Directory`]
- `subfolders-nested`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] in the same dir below Gt-Directory

## Line metadata

In addition to the bag-wide metadata defined by the [BagIt profile], metadata
can be saved per line to preserve the provenance of every single line.

Line metadata can be encoded in JSON or YAML (depending on
[`Gt-Line-Metadata-Extension`] and [`Gt-Line-Metadata-Media-Type`]).

Line metadata mustt adhere to this JSON schema:
kba marked this conversation as resolved.
Show resolved Hide resolved

<!-- BEGIN-EVAL -w '```yaml' '```' -- cat single-line.yml -->
```yaml
description: Schema for provenance of single lines
type: object
required:
- coords
- imageUrl
properties:
coords:
description: Coordinates as array of x-y-pairs
type: array
items:
type: array
length: 2
items:
type: number
pageUrl:
description: URL of the page (resp. URL the PAGE-XML file)
type: string
imageUrl:
description: URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)
type: string
bagUrl:
description: URL of the bag that contains the page
type: string
metsUrl:
description: URL of the METS document that contains the page
type: string
lineId:
description: ID of the line within the PAGE-XML doc
type: string
xpath:
description: XPath to the line if no `fileId` was provided
type: string
```

<!-- END-EVAL -->

<!--
==================================================================
Reference links
==================================================================
--->
[`Gt-Directory`]: #gt-directory
[`Gt-Bitonal-Image-Directory`]: #gt-bitonal-image-directory
[`Gt-Transcription-Directory`]: #gt-transcription-directory
[`Gt-Directory-Structure`]: #gt-directory-structure
[`Gt-Line-Metadata-Directory`]: #gt-bitonal-image-directory
[`Gt-Line-Metadata-Extension`]: #gt-line-metadata-extension
[`Gt-Line-Metadata-Media-Type`]: #gt-line-metadata-media-type
[BagIt Profile]: #bagit-profile
[line metadata]: #line-metadata
1 change: 1 addition & 0 deletions single-line.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"description":"Schema for provenance of single lines","type":"object","required":["coords","imageUrl"],"properties":{"coords":{"description":"Coordinates as array of x-y-pairs","type":"array","items":{"type":"array","length":2,"items":{"type":"number"}}},"pageUrl":{"description":"URL of the page (resp. URL the PAGE-XML file)","type":"string"},"imageUrl":{"description":"URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)","type":"string"},"bagUrl":{"description":"URL of the bag that contains the page","type":"string"},"metsUrl":{"description":"URL of the METS document that contains the page","type":"string"},"lineId":{"description":"ID of the line within the PAGE-XML doc","type":"string"},"xpath":{"description":"XPath to the line if no `fileId` was provided","type":"string"}}}
33 changes: 33 additions & 0 deletions single-line.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
description: Schema for provenance of single lines
type: object
required:
- coords
- imageUrl
properties:
coords:
description: Coordinates as array of x-y-pairs
type: array
items:
type: array
length: 2
items:
type: number
pageUrl:
description: URL of the page (resp. URL the PAGE-XML file)
type: string
imageUrl:
description: URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)
type: string
bagUrl:
description: URL of the bag that contains the page
type: string
metsUrl:
description: URL of the METS document that contains the page
type: string
lineId:
description: ID of the line within the PAGE-XML doc
type: string
xpath:
description: XPath to the line if no `fileId` was provided
type: string

1 change: 1 addition & 0 deletions training-schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"$id":"https://ocr-d.github.io/schemas/v1/training-schema.json","type":"object","required":["engineName","engineVersion","groundTruthBag","outputModelFormat"],"properties":{"engineName":{"type":"string","enum":["ocropus","kraken","tesseract","calamari"]},"engineVersion":{"type":"string"},"engineArguments":{"description":"Command line arguments passed to the CLI training tool","type":"array","default":[]},"groundTruthBag":{"description":"A bag of line ground truth adhering to https://ocr-d.github.io/gt-profile.json","type":"string"},"groundTruthGlob":{"description":"Wildcard for matching only a subset of the ground truth files. Make sure to exclude extensions and end in '*'.","type":"string","default":"*"},"outputModelFormat":{"description":"The output format of the model. Note that individual engines only support a single one or a subset of formats.","enum":["application/vnd.ocrd.pronn","application/vnd.ocrd.clstm","application/vnd.ocrd.coreml","application/vnd.ocrd.pyrnn","application/vnd.ocrd.tf+zip"]},"evalRatio":{"description":"Ratio of evaluation vs. training data to divide up ground truth","type":"number","default":0.9},"randomSeed":{"description":"Seed for the random number generator shuffling the ground truth before dividing it into evaluation vs. training data","type":"integer","default":0}}}
Loading