-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 1st draft line GT/training specs #105
base: master
Are you sure you want to change the base?
Changes from 1 commit
d02d78f
73dc4c4
ab7b6e0
00fa0aa
d0dcab0
b91c9a7
9300b25
2ae7d57
b348592
5b3d6df
c8bdc85
b706d82
c9aeb4a
c38a7a6
86761a6
6a9e00b
6827085
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"BagIt-Profile-Info":{"BagIt-Profile-Identifier":"https://ocr-d.github.io/gt-profile.json","BagIt-Profile-Version":"1.2.0","Source-Organization":"OCR-D","External-Description":"BagIt profile for OCR line Ground Truth","Contact-Name":"Konstantin Baierer","Contact-Email":"konstantin.baierer@sbb.spk-berlin.de","Version":0.1},"Bag-Info":{"Bagging-Date":{"required":false},"Source-Organization":{"required":false},"Gt-Transcription-Extension":{"required":false,"default":".gt.txt"},"Gt-Transcription-Media-Type":{"required":false,"default":"text/plain"},"Gt-Transcription-Directory":{"required":false,"default":"text"},"Gt-Transcription-Normalization":{"required":false,"default":"NFKC","values":["NFD","NFKD","NFC","NFKC"]},"Gt-Color-Image-Extension":{"required":false,"default":".color.png"},"Gt-Color-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Color-Image-Directory":{"required":false,"default":"img"},"Gt-Grayscale-Image-Extension":{"required":false,"default":".nrm.png"},"Gt-Grayscale-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Grayscale-Image-Directory":{"required":false,"default":"grayscale"},"Gt-Bitonal-Image-Extension":{"required":false,"default":".bin.png"},"Gt-Bitonal-Image-Media-Type":{"required":false,"default":"image/png","values":["image/png","image/tiff","image/jpeg"]},"Gt-Bitonal-Image-Directory":{"required":false,"default":"bin"},"Gt-Line-Metadata-Extension":{"required":false,"default":".json"},"Gt-Line-Metadata-Media-Type":{"required":false,"default":"application/json","values":["application/json","text/vnd.yaml"]},"Gt-Line-Metadata-Directory":{"required":false,"default":"meta"},"Gt-Directory":{"required":false,"default":"ground-truth"},"Gt-Directory-Structure":{"required":false,"default":"flat","values":["flat","flat-nested","subfolders","subfolders-nested"]}},"Manifests-Required":["sha512"],"Tag-Manifests-Required":[],"Tag-Files-Required":[],"Tag-Files-Allowed":["README.md"],"Allow-Fetch.txt":false,"Serialization":"allowed","Accept-Serialization":"application/zip","Accept-BagIt-Version":["1.0"]} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
BagIt-Profile-Info: | ||
BagIt-Profile-Identifier: https://ocr-d.github.io/gt-profile.json | ||
BagIt-Profile-Version: '1.2.0' | ||
Source-Organization: OCR-D | ||
External-Description: BagIt profile for OCR line Ground Truth | ||
Contact-Name: Konstantin Baierer | ||
Contact-Email: konstantin.baierer@sbb.spk-berlin.de | ||
Version: 0.1 | ||
Bag-Info: | ||
Bagging-Date: | ||
required: false | ||
Source-Organization: | ||
required: false | ||
Gt-Transcription-Extension: | ||
required: false | ||
default: '.gt.txt' | ||
Gt-Transcription-Media-Type: | ||
required: false | ||
default: 'text/plain' | ||
Gt-Transcription-Directory: | ||
required: false | ||
default: 'text' | ||
Gt-Transcription-Normalization: | ||
required: false | ||
default: 'NFKC' | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
values: | ||
- NFD | ||
- NFKD | ||
- NFC | ||
- NFKC | ||
Gt-Color-Image-Extension: | ||
required: false | ||
default: '.color.png' | ||
Gt-Color-Image-Media-Type: | ||
required: false | ||
default: 'image/png' | ||
values: | ||
- 'image/png' | ||
- 'image/tiff' | ||
- 'image/jpeg' | ||
Gt-Color-Image-Directory: | ||
required: false | ||
default: 'img' | ||
Gt-Grayscale-Image-Extension: | ||
required: false | ||
default: '.nrm.png' | ||
Gt-Grayscale-Image-Media-Type: | ||
required: false | ||
default: 'image/png' | ||
values: | ||
- 'image/png' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would a differentiation between Tiff compressed or JPEG2000 make more sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mean additionally allow |
||
- 'image/tiff' | ||
- 'image/jpeg' | ||
Gt-Grayscale-Image-Directory: | ||
required: false | ||
default: 'grayscale' | ||
Gt-Bitonal-Image-Extension: | ||
required: false | ||
default: '.bin.png' | ||
Gt-Bitonal-Image-Media-Type: | ||
required: false | ||
default: 'image/png' | ||
values: | ||
- 'image/png' | ||
- 'image/tiff' | ||
- 'image/jpeg' | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Gt-Bitonal-Image-Directory: | ||
required: false | ||
default: 'bin' | ||
Gt-Line-Metadata-Extension: | ||
required: false | ||
default: '.json' | ||
Gt-Line-Metadata-Media-Type: | ||
required: false | ||
default: 'application/json' | ||
values: | ||
- 'application/json' | ||
- 'text/vnd.yaml' | ||
Gt-Line-Metadata-Directory: | ||
required: false | ||
default: 'meta' | ||
Gt-Directory: | ||
required: false | ||
default: 'ground-truth' | ||
Gt-Directory-Structure: | ||
required: false | ||
default: 'flat' | ||
values: | ||
# img and transcription in the Gt-Directory | ||
- 'flat' | ||
# img and transcription in the same dir below Gt-Directory | ||
- 'flat-nested' | ||
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory of Gt-Directory | ||
- 'subfolders' | ||
# img and transcription in subfolders Gt-Bitonal-Image-Directory and Gt-Transcription-Directory in the same dir below Gt-Directory | ||
- 'subfolders-nested' | ||
Manifests-Required: ['sha512'] | ||
Tag-Manifests-Required: [] | ||
Tag-Files-Required: [] | ||
Tag-Files-Allowed: | ||
- README.md | ||
Allow-Fetch.txt: false | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Serialization: allowed | ||
Accept-Serialization: application/zip | ||
Accept-BagIt-Version: | ||
- '1.0' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# linegt | ||
|
||
> An exchange format for line-based ground truth for OCR | ||
|
||
<!-- BEGIN-MARKDOWN-TOC --> | ||
* [Rationale](#rationale) | ||
* [BagIt](#bagit) | ||
* [BagIt profile](#bagit-profile) | ||
* [Gt-Transcription-Extension](#gt-transcription-extension) | ||
* [Gt-Transcription-Media-Type](#gt-transcription-media-type) | ||
* [Gt-Transcription-Directory](#gt-transcription-directory) | ||
* [Gt-Transcription-Normalization](#gt-transcription-normalization) | ||
* [Gt-Grayscale-Image-Extension](#gt-grayscale-image-extension) | ||
* [Gt-Grayscale-Image-Media-Type](#gt-grayscale-image-media-type) | ||
* [Gt-Grayscale-Image-Directory](#gt-grayscale-image-directory) | ||
* [Gt-Color-Image-Extension](#gt-color-image-extension) | ||
* [Gt-Color-Image-Media-Type](#gt-color-image-media-type) | ||
* [Gt-Color-Image-Directory](#gt-color-image-directory) | ||
* [Gt-Bitonal-Image-Extension](#gt-bitonal-image-extension) | ||
* [Gt-Bitonal-Image-Media-Type](#gt-bitonal-image-media-type) | ||
* [Gt-Bitonal-Image-Directory](#gt-bitonal-image-directory) | ||
* [Gt-Line-Metadata-Extension](#gt-line-metadata-extension) | ||
* [Gt-Line-Metadata-Media-Type](#gt-line-metadata-media-type) | ||
* [Gt-Line-Metadata-Directory](#gt-line-metadata-directory) | ||
* [Gt-Directory](#gt-directory) | ||
* [Gt-Directory-Structure](#gt-directory-structure) | ||
* [Line metadata](#line-metadata) | ||
|
||
<!-- END-MARKDOWN-TOC --> | ||
|
||
## Rationale | ||
|
||
Recent OCR (optical character recognition) engines are not actually | ||
character-based anymore but on neural networks that operate on lines. These | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
engines can be trained with images of text lines and their transcription | ||
("ground truth"), plus engine-specific configurations. | ||
|
||
This format defines a standardized format to bundle such ground truth, based on | ||
the BagIt conventions. | ||
|
||
## BagIt | ||
|
||
An `linegt` bag must be a valid BagIt bag: | ||
|
||
* Root folder must contain a file `bagit.txt` | ||
* Root folder must contain a file `bag-info.txt` with metadata about the bag | ||
* All payload files must be under a folder `/data` | ||
* Every file in `/data` along with its `<algo>` checksum must be listed in a | ||
file `manifest-<algo>.txt` | ||
|
||
## BagIt profile | ||
|
||
In addition to the requirements of BagIt, an `ocr_linegt` bag must adhere to | ||
the `ocr_linegt` BagIt profile. | ||
|
||
### Gt-Transcription-Extension | ||
|
||
Extension of the transcription files. Default: `.gt.txt`. | ||
|
||
### Gt-Transcription-Media-Type | ||
|
||
Media type of the transcription files. Default: `text/plain`. | ||
|
||
### Gt-Transcription-Directory | ||
|
||
Name of the subfolder containing transcriptions if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `text`. | ||
|
||
### Gt-Transcription-Normalization | ||
|
||
Unicode normalization level. One of `NFC`, `NFKC`, `NFD` or `NFKC`. Default: `NFKC`. | ||
|
||
![Illustration unicode normalization](http://unicode.org/reports/tr15/images/UAX15-NormFig6.jpg) | ||
|
||
### Gt-Grayscale-Image-Extension | ||
|
||
Extension of the grayscale image files. Default: `.png`. | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Gt-Grayscale-Image-Media-Type | ||
|
||
Media type of the grayscale image files. Default: `image/png`. | ||
|
||
### Gt-Grayscale-Image-Directory | ||
|
||
Name of the subfolder containing grayscale images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `grayscale`. | ||
|
||
### Gt-Color-Image-Extension | ||
|
||
Extension of the color image files. Default: `.png`. | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Gt-Color-Image-Media-Type | ||
|
||
Media type of the color image files. Default: `image/png`. | ||
|
||
### Gt-Color-Image-Directory | ||
|
||
Name of the subfolder containing color images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `img`. | ||
|
||
### Gt-Bitonal-Image-Extension | ||
|
||
Extension of the bitonal image files. Default: `.png`. | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Gt-Bitonal-Image-Media-Type | ||
|
||
Media type of the bitonal image files. Default: `image/png`. | ||
|
||
### Gt-Bitonal-Image-Directory | ||
|
||
Name of the subfolder containing bitonal images if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `bin`. | ||
|
||
### Gt-Line-Metadata-Extension | ||
|
||
Extension of the [line metadata] files. Default: `.json`. | ||
|
||
### Gt-Line-Metadata-Media-Type | ||
|
||
Media type of the [line metadata] files. Default: `application/json`. | ||
|
||
### Gt-Line-Metadata-Directory | ||
|
||
Name of the subfolder containing [line metadata] if [`Gt-Directory-Structure`] is `subfolders` or `subfolders-nested`. Default: `meta`. | ||
|
||
### Gt-Directory | ||
|
||
Directory below `/data` containing the ground truth. Default: `ground-truth`. | ||
|
||
### Gt-Directory-Structure | ||
|
||
Directory structure. One of | ||
|
||
- `flat`: img and transcription in the [`Gt-Directory`] | ||
- `flat-nested`: img and transcription in the same dir below [`Gt-Directory`] | ||
- `subfolders`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] of [`Gt-Directory`] | ||
- `subfolders-nested`: img and transcription in subfolders [`Gt-Bitonal-Image-Directory`] and [`Gt-Transcription-Directory`] in the same dir below Gt-Directory | ||
|
||
## Line metadata | ||
|
||
In addition to the bag-wide metadata defined by the [BagIt profile], metadata | ||
can be saved per line to preserve the provenance of every single line. | ||
|
||
Line metadata can be encoded in JSON or YAML (depending on | ||
[`Gt-Line-Metadata-Extension`] and [`Gt-Line-Metadata-Media-Type`]). | ||
|
||
Line metadata mustt adhere to this JSON schema: | ||
kba marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<!-- BEGIN-EVAL -w '```yaml' '```' -- cat single-line.yml --> | ||
```yaml | ||
description: Schema for provenance of single lines | ||
type: object | ||
required: | ||
- coords | ||
- imageUrl | ||
properties: | ||
coords: | ||
description: Coordinates as array of x-y-pairs | ||
type: array | ||
items: | ||
type: array | ||
length: 2 | ||
items: | ||
type: number | ||
pageUrl: | ||
description: URL of the page (resp. URL the PAGE-XML file) | ||
type: string | ||
imageUrl: | ||
description: URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file) | ||
type: string | ||
bagUrl: | ||
description: URL of the bag that contains the page | ||
type: string | ||
metsUrl: | ||
description: URL of the METS document that contains the page | ||
type: string | ||
lineId: | ||
description: ID of the line within the PAGE-XML doc | ||
type: string | ||
xpath: | ||
description: XPath to the line if no `fileId` was provided | ||
type: string | ||
``` | ||
|
||
<!-- END-EVAL --> | ||
|
||
<!-- | ||
================================================================== | ||
Reference links | ||
================================================================== | ||
---> | ||
[`Gt-Directory`]: #gt-directory | ||
[`Gt-Bitonal-Image-Directory`]: #gt-bitonal-image-directory | ||
[`Gt-Transcription-Directory`]: #gt-transcription-directory | ||
[`Gt-Directory-Structure`]: #gt-directory-structure | ||
[`Gt-Line-Metadata-Directory`]: #gt-bitonal-image-directory | ||
[`Gt-Line-Metadata-Extension`]: #gt-line-metadata-extension | ||
[`Gt-Line-Metadata-Media-Type`]: #gt-line-metadata-media-type | ||
[BagIt Profile]: #bagit-profile | ||
[line metadata]: #line-metadata |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"description":"Schema for provenance of single lines","type":"object","required":["coords","imageUrl"],"properties":{"coords":{"description":"Coordinates as array of x-y-pairs","type":"array","items":{"type":"array","length":2,"items":{"type":"number"}}},"pageUrl":{"description":"URL of the page (resp. URL the PAGE-XML file)","type":"string"},"imageUrl":{"description":"URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file)","type":"string"},"bagUrl":{"description":"URL of the bag that contains the page","type":"string"},"metsUrl":{"description":"URL of the METS document that contains the page","type":"string"},"lineId":{"description":"ID of the line within the PAGE-XML doc","type":"string"},"xpath":{"description":"XPath to the line if no `fileId` was provided","type":"string"}}} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
description: Schema for provenance of single lines | ||
type: object | ||
required: | ||
- coords | ||
- imageUrl | ||
properties: | ||
coords: | ||
description: Coordinates as array of x-y-pairs | ||
type: array | ||
items: | ||
type: array | ||
length: 2 | ||
items: | ||
type: number | ||
pageUrl: | ||
description: URL of the page (resp. URL the PAGE-XML file) | ||
type: string | ||
imageUrl: | ||
description: URL of the image (resp. the `pg:imageFilename` in the PAGE-XML file) | ||
type: string | ||
bagUrl: | ||
description: URL of the bag that contains the page | ||
type: string | ||
metsUrl: | ||
description: URL of the METS document that contains the page | ||
type: string | ||
lineId: | ||
description: ID of the line within the PAGE-XML doc | ||
type: string | ||
xpath: | ||
description: XPath to the line if no `fileId` was provided | ||
type: string | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"$id":"https://ocr-d.github.io/schemas/v1/training-schema.json","type":"object","required":["engineName","engineVersion","groundTruthBag","outputModelFormat"],"properties":{"engineName":{"type":"string","enum":["ocropus","kraken","tesseract","calamari"]},"engineVersion":{"type":"string"},"engineArguments":{"description":"Command line arguments passed to the CLI training tool","type":"array","default":[]},"groundTruthBag":{"description":"A bag of line ground truth adhering to https://ocr-d.github.io/gt-profile.json","type":"string"},"groundTruthGlob":{"description":"Wildcard for matching only a subset of the ground truth files. Make sure to exclude extensions and end in '*'.","type":"string","default":"*"},"outputModelFormat":{"description":"The output format of the model. Note that individual engines only support a single one or a subset of formats.","enum":["application/vnd.ocrd.pronn","application/vnd.ocrd.clstm","application/vnd.ocrd.coreml","application/vnd.ocrd.pyrnn","application/vnd.ocrd.tf+zip"]},"evalRatio":{"description":"Ratio of evaluation vs. training data to divide up ground truth","type":"number","default":0.9},"randomSeed":{"description":"Seed for the random number generator shuffling the ground truth before dividing it into evaluation vs. training data","type":"integer","default":0}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about information about the origin of the digitized lines?
I think that comment may be in the wrong place here. It should probably be placed in this place ## Line metadata##.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/OCR-D/spec/pull/105/files/6827085d051e945062203b82ef921e54025cfbda#diff-ee256e83a17cfe309565c88ab376091a That is the definition of what's currently supposed to be in there. Bibliographic metadata would be in the METS referred to by
metsUrl
. How to encode provenance on a line-level I am not sure though. @VolkerHartmann?