A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.
LatinHist-98000.pyrnn.gz is a mixed OCRopus model trained on twelve Latin books printed with Antiqua types between 1471 and 1686 with a focus (ten out of twelve) on early works produced before 1600. For details about the books please see [1] . The training was performed on 8,684 lines and the best model was chosen by evaluating all resulting models on 2,432 previously unseen test lines. The lowest achieved CER was 2.92% after 98,000 training steps.
The Books folder contains seven early printed books used for evaluation in [2][3]. For each book we made 150 lines of GT available as well as a strong individual model (coming soon) trained on an extensive amount of lines (at least 1,500). For details about the books please see [2][3].
For additional GT for Latin, Greek and German Fraktur please see the CIS OCR Testset.
All data available in this repository is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please cite (one of) the following publications when using the data.
[1] Springmann et al.. (2016). Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints.
@article{springmann2016automatic,
title = {Automatic quality evaluation and (semi-) automatic
improvement of {OCR} models for historical printings},
author = {Springmann, Uwe and Fink, Florian and Schulz, Klaus U},
journal = {ArXiv e-prints},
url = {https://arxiv.org/abs/1606.05157},
year = {2016}
}
[2] Reul et al.. (2017). Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. ArXiv e-prints. Submitted to the 13th IAPR International Workshop on Document Analysis Systems.
@article{reul2017voting,
title = {Improving {OCR} Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting},
author = {Reul, Christian and Springmann, Uwe and Wick, Christoph and Puppe, Frank},
journal = {ArXiv e-prints},
url = {https://arxiv.org/abs/1711.09670},
year = {2017}
}
[3] Reul et al.. (2017). Transfer Learning for OCRopus Model Training on Early Printed Books. ArXiv e-prints. Submitted to 027.7 - Journal for Library Culture.
@article{reul2017transfer,
title = {Transfer Learning for OCRopus Model Training on Early Printed Books},
author = {Reul, Christian and Wick, Christoph and Springmann, Uwe and Puppe, Frank},
journal = {ArXiv e-prints},
url = {https://arxiv.org/abs/1712.05586},
year = {2017}
}