Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structMap[@TYPE=OCR-D-LOGICAL] / FULLDOWNLOAD #154

Closed
wants to merge 16 commits into from
134 changes: 129 additions & 5 deletions mets.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# Requirements on handling METS/PAGE

OCR-D has decided to base its data exchange format on top of [METS](http://www.loc.gov/standards/mets/).
- OCR-D has decided to base its data exchange format on top of [METS](http://www.loc.gov/standards/mets/).

For layout and text recognition results, the primary exchange format is [PAGE](https://github.com/OCR-D/PAGE-XML)
- For layout and text recognition results, the primary exchange format is [PAGE](https://github.com/OCR-D/PAGE-XML)

This document defines a set of conventions and mechanism for using METS.
- This document defines a set of conventions and mechanism for using METS.
The basis for METS is the METS Application Profile
for digitised media (currently: http://dfg-viewer.de/fileadmin/groups/dfgviewer/METS-Anwendungsprofil_2.3.1.pdf).

Conventions for PAGE are outlined in [a separate document](page)
- Conventions for PAGE are outlined in [a separate document](page).

kba marked this conversation as resolved.
Show resolved Hide resolved
## Pixel density of images must be explicit and high enough

Expand Down Expand Up @@ -95,12 +97,19 @@ with the type of manipulation (`BIN-KRAKEN`).
`<mets:fileGrp USE="OCR-D-IMG-DESKEW">` | Deskewed images
`<mets:fileGrp USE="OCR-D-IMG-DESPECK">` | Despeckled images
`<mets:fileGrp USE="OCR-D-IMG-DEWARP">` | Dewarped images
`<mets:fileGrp USE="OCR-D-SEG-REGION">` | Region segmentation
`<mets:fileGrp USE="OCR-D-SEG-REGION">` | Region segmentation
`<mets:fileGrp USE="OCR-D-SEG-LINE">` | Line segmentation
`<mets:fileGrp USE="OCR-D-SEG-WORD">` | Word segmentation
`<mets:fileGrp USE="OCR-D-SEG-GLYPH">` | Glyph segmentation
`<mets:fileGrp USE="OCR-D-OCR-TESS">` | Tesseract OCR
`<mets:fileGrp USE="OCR-D-OCR-ANY">` | AnyOCR
`<mets:fileGrp USE="OCR-D-TEI">` | TEI
`<mets:fileGrp USE="OCR-D-ALTO">` | ALTO
`<mets:fileGrp USE="OCR-D-hOCR">` | hOCR
`<mets:fileGrp USE="OCR-D-HTML">` | HTML
`<mets:fileGrp USE="OCR-D-TXT">` | Text
`<mets:fileGrp USE="OCR-D-COCO">` | [COCO](http://cocodataset.org/#format-data)
`<mets:fileGrp USE="OCR-D-PDF">` | PDF
tboenig marked this conversation as resolved.
Show resolved Hide resolved
`<mets:fileGrp USE="OCR-D-COR-CIS">` | CIS post-correction
`<mets:fileGrp USE="OCR-D-COR-ASV">` | ASV post-correction
`<mets:fileGrp USE="OCR-D-GT-IMG-BIN">` | Black-and-White images ground truth
Expand Down Expand Up @@ -131,6 +140,41 @@ PROCESSOR := [A-Z0-9\-]{3,}
`<mets:file ID="OCR-D-IMG_0001">` | The unmanipulated source image
`<mets:file ID="OCR-D-IMG-BIN_0001">` | Black-and-White image

#### Fulldownload

For `mets:file` entries representative of the publication **as a whole**, the `ID` attribute MUST have prefix `FULLDOWNLOAD_`, followed by the file format (`TEI`, `ALTO`, `hOCR`, `HTML`, `TXT`, `COCO`, `PDF`).

These entries SHOULD be referenced in the [structMap](#ocr-d-structmap) under `/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:fptr`.

##### Examples
`<mets:file ID>` | ID of the file for OCR-D
-- | --
`<mets:file ID="FULLDOWNLOAD_TEI" MIMETYPE="application/tei+xml">` | The digitised publication or book in TEI format.
`<mets:file ID="FULLDOWNLOAD_TEI_01" MIMETYPE="application/pdf">` | The digitised publication or book in TEI format. Version one.
`<mets:file ID="FULLDOWNLOAD_TEI_02" MIMETYPE="application/tei+xml">` | The digitised publication or book in TEI format, a second Version.

##### Examples with `fileGrp`

```xml
<mets:fileGrp USE="DOWNLOAD">
<mets:file ID="FULLDOWNLOAD_TEI" MIMETYPE="application/tei+xml">...</mets:file>
<mets:file ID="FULLDOWNLOAD_PDF" MIMETYPE="application/pdf">...</mets:file>
</mets:fileGrp>

<mets:fileGrp USE="OCR-D-IMG">
<mets:file ID="OCR-D-IMG_0001">...</mets:file>
</mets:fileGrp>
<mets:structMap TYPE="PHYSICAL">
<mets:div ID="PHYS_0000" TYPE="physSequence">
<mets:fptr FILEID="FULLDOWNLOAD_TEI"/>
<mets:fptr FILEID="FULLDOWNLOAD_PDF"/>
<mets:div ID="PHYS_0001" TYPE="page">
<mets:fptr FILEID="OCR-D-IMG_0001"/>
</mets:div>
</mets:div>
</mets:structMap>
```

## Grouping files by page

Every METS file MUST have exactly one physical map that contains a single
Expand Down Expand Up @@ -160,6 +204,86 @@ encodings of the same page.
</mets:structMap>
```

## OCR-D structMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to put this section past Grouping files by page – the latter should be integrated into the former as a subsection!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here my proposal for a new document structure:

Requirements on handling METS/PAGE

  1. Metadata
    1.1 Unique @ID for the document processed

  2. Images
    2.1. Pixel density of images must be explicit and high enough
    2.2. No multi-page images
    2.3 Image coordinates
    2.4 If in PAGE then in METS

  3. File Group mets:fileGrp
    3.1 @USE syntax
    Examples

  4. File mets:file
    4.1 @ID syntax
    Examples
    4.2 @MIMETYPE syntax
    Examples
    Examples (Media Type for PAGE XML)

  5. Grouping files by page mets:structMap
    Example
    5.1 @TYPE syntax
    Example

  6. Range of pages mets:structLink
    Example

  7. Paths
    7.1 Always use URL or relative filenames
    Example

  8. Recording processing information in METS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I would:

  • subsume 8 (processing information) under 1 (metadata)
  • abandon 2.3 (frankly, I don't know why this resides here and not just in PAGE.md)
  • replace 2.3 with a general note about original/derived images (what is now in PAGE.md, but including new language from Alternative image same folder #164)

But I wonder: where in that outline did Fulldownload go? Is it still subsumed under 4.1 for you? (We discussed this elsewhere: then you cannot make these subsections self-contained.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • abandon 2.3 (frankly, I don't know why this resides here and not just in PAGE.md)
  • replace 2.3 with a general note about original/derived images (what is now in PAGE.md, but including new language from Alternative image same folder #164)

That's right, I think 2.3 is better found in page.md.

  • subsume 8 (processing information) under 1 (metadata)
    That`s a good proposal.

But I wonder: where in that outline did Fulldownload go? Is it still subsumed under 4.1 for you? (We discussed this elsewhere: then you cannot make these subsections self-contained.)

But I wonder: where in that outline did Fulldownload go? Is it still subsumed under 4.1 for you? (We discussed this elsewhere: then you cannot make these subsections self-contained.)

Yes, Fulldownload is a section/part under 4.1

Copy link
Contributor

@tboenig tboenig Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requirements on handling METS/PAGE

1 Metadata
1.1 Recording processing information in METS
1.2 Unique @ID for the document processed

2 Images
2.1. Pixel density of images must be explicit and high enough
2.2. No multi-page images
2.3 If in PAGE then in METS

3 File Group mets:fileGrp
3.1 @USE syntax
Examples
3.2 @USE="FULLDOWNLOAD_..."
Examples

4 File mets:file
4.1 @ID syntax
Examples
4.2 @MIMETYPE syntax
Examples
Examples (Media Type for PAGE XML)

5 Grouping files by page mets:structMap
Example
5.1 @TYPE syntax
Example

6 Range of pages mets:structLink
Example

7 Paths
7.1 Always use URL or relative filenames
Example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved?

No, AFAICS it is not. The section Fulldownload should still be part of the file ID syntax (differentiating between page-local and document-global naming scheme, but not trying to formulate this "self-contained"). Also, the section about grouping files by structMap should come below fileGrp and file ID sections.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or did you want to do all that in a separate PR, or just wait for the merge with master?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also 7.1 could be 1.3 instead.

Yes!


A METS may contain different `mets:structMap` entries, differentiated by their `TYPE` attribute (e.g. `LOGICAL`, `PHYSICAL, ...`).
* A `mets:structMap` with `TYPE="PHYSICAL"` is mandatory.
* The _logical_ document structure detected by _library or archive_ MUST be described by `TYPE="LOGICAL"`.
* The _logical_ document structure detected by _OCR-D software_ MUST be described by `TYPE="OCR-D-LOGICAL"`.
The _logical_ document structure detected by OCR-D software MUST be described by `TYPE="OCR-D-LOGICAL"`.
kba marked this conversation as resolved.
Show resolved Hide resolved

attributes in `structMap` | description
-- | --
`LABEL` | contains the recognized text of a structuring component, e.g. the title of a chapter
`TYPE` | contains the type of a structuring component according to some standardized, controlled vocabulary (see [DFG-Viewer: structural data set](https://dfg-viewer.de/strukturdatenset/)), e.g. `chapter`

### `mets:structLink`

The `mets:structLink`describes the range of pages in part of document.


### Example

```xml
<mets:fileGrp USE="OCR-D-IMG">
<mets:file ID="OCR-D-IMG_0001" >...</mets:file>
</mets:fileGrp>
<mets:fileGrp USE="OCR-D-OCR">
<mets:file ID="OCR-D-OCR_0001" >...</mets:file>
</mets:fileGrp>
<mets:structMap TYPE="OCR-D-LOGICAL">
<mets:div DMDID="dmdSec_0001" ADMID="amdSec_0001" ID="OCR-D-loc_0001">
<mets:div ID="OCR-D-loc_d5e320" TYPE="chapter" LABEL="KapıteI 1">
<mets:div ID="OCR-D-loc_d7e560" TYPE="chapter" LABEL="Unterkapitel"/>
</mets:div>
<mets:div ID="OCR-D-loc_d9e376" TYPE="chapter" LABEL="Kapidel 2"/>
</mets:div>
</mets:structMap>
<mets:structMap TYPE="LOGICAL">
<mets:div TYPE="Monograph" DMDID="dmdSec_0001" ADMID="amdSec_0001" ID="loc_0001">
<mets:div ID="loc_d1e410" TYPE="chapter" LABEL="Kapitel 1"/>
<mets:div ID="loc_d1e451" TYPE="chapter" LABEL="Kapitel 2"/>
</mets:div>
</mets:structMap>
<mets:structMap TYPE="PHYSICAL">
<mets:div ID="PHYS_0000" TYPE="physSequence">
<mets:div ID="PHYS_0001" TYPE="page">
<mets:fptr FILEID="OCR-D-IMG_0001"/>
<mets:fptr FILEID="OCR-D-OCR_0001"/>
</mets:div>
</mets:div>
</mets:structMap>
<mets:structLink>

<!-- Library-Part-->
<mets:smLink xlink:from="loc_0001" xlink:to="PHYS_0000"/>
<mets:smLink xlink:from="loc_d1e410" xlink:to="PHYS_0001"/>
<mets:smLink xlink:from="loc_d1e410" xlink:to="PHYS_0002"/>
<mets:smLink xlink:from="loc_d1e410" xlink:to="PHYS_0003"/>
<mets:smLink xlink:from="loc_d1e410" xlink:to="PHYS_0004"/>
<mets:smLink xlink:from="loc_d1e451" xlink:to="PHYS_0005"/>
<mets:smLink xlink:from="loc_d1e451" xlink:to="PHYS_0006"/>

<!-- OCR-D-Part-->
<mets:smLink xlink:from="OCR-D-loc_0001" xlink:to="PHYS_0000"/>
<!-- Kapitel 1-->
<mets:smLink xlink:from="OCR-D-loc_d5e320" xlink:to="PHYS_0001"/>
<mets:smLink xlink:from="OCR-D-loc_d5e320" xlink:to="PHYS_0002"/>
<mets:smLink xlink:from="OCR-D-loc_d5e320" xlink:to="PHYS_0003"/>
<mets:smLink xlink:from="OCR-D-loc_d5e320" xlink:to="PHYS_0004"/>

<!-- Unter-Kapitel zu 1-->
<mets:smLink xlink:from="OCR-D-loc_d7e560" xlink:to="PHYS_0002"/>
<mets:smLink xlink:from="OCR-D-loc_d7e560" xlink:to="PHYS_0003"/>
<mets:smLink xlink:from="OCR-D-loc_d7e560" xlink:to="PHYS_0004"/>

<!-- Kapitel 2-->
<mets:smLink xlink:from="OCR-D-loc_d7e560" xlink:to="PHYS_0005"/>
<mets:smLink xlink:from="OCR-D-loc_d7e560" xlink:to="PHYS_0006"/>

</mets:structLink>
```

## Images and coordinates

Coordinates are always absolute, i.e. relative to extent defined in the `imageWidth`/`imageHeight` attribute of the nearest `<pc:Page>`.
Expand Down