A curated list of awesome tools and literature for historical newspaper analysis, including data standardization, optical character recognition, document layout analysis, text enrichment, semantic segmentation, quality evaluation and natural language analysis.
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information.
-
hocr-tools Tools for the manipulation of hOCR files and the evaluation of OCR quality.
-
hocrjs - Visualization of hOCR files.
-
PAGEviewer - Visualization of page layout and OCR segmentation for PAGE XML, ALTO XML, FineReader XML and hOCR.
-
Tesseract OCR engine - Open source C++ api and command line tool. Provides basic layout analysis.
-
Ocrad - The GNU OCR.
-
layout-parser - Dectron2-based layout analysis tool.
- Aletheia - Ground truth annotation tool.
-
scikit-learn - Well-documented general purpose ML library.
-
TidyText - Manipulation of text data (R package; easy to do the same with pandas).
-
LdaSeqModel - Dynamic topic modeling in Python.
- The Impresso project - Text mining 200 years of historical newspapers.
- Take a look at their app and blog.
- Take a look at their GitHub Organization.