Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.
Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.
$ tesseract
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
You can install the development version of TessTools
from
GitHub with:
# install.packages("devtools")
devtools::install_github("OlivierBinette/TessTools")
Download the first issue (1905) of the Duke Chronicle newspaper.
library(TessTools)
issueID = chronicle_meta[1, "local_id"]
zipfile = download_chronicle(issueID, outputdir="data-raw")
Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.
hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img")
#> Running tesseract-OCR on 4 image files.
# Extract paragraph text
text = paragraphs(hocrfiles)
text[[1]][9:11, ] # Some paragraphs on the first page
bbox1 | bbox2 | bbox3 | bbox4 | text |
---|---|---|---|---|
481 | 1251 | 1099 | 1314 | HESPERIAN VS. COLUMBIAN. |
361 | 1394 | 1225 | 1554 | Sixteenth Annual Inter-Society Debate –Won By the Hesperian. |
424 | 1592 | 822 | 1653 | A great debate! |
Visualize the result using hocrjs:
webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html
browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"
Paragraphs of the first issue have been annotated according to the article to which they belong.
# Ground truth for first page
vol1_paragraphs_truth[[1]][9:11, ]
bbox1 | bbox2 | bbox3 | bbox4 | text | articleID | category | note | |
---|---|---|---|---|---|---|---|---|
9 | 481 | 1251 | 1099 | 1314 | HESPERIAN VS. COLUMBIAN. | 1 | title | |
10 | 361 | 1394 | 1225 | 1554 | Sixteenth Annual Inter-Society Debate —Won By the Hesperian. | 1 | title | |
11 | 424 | 1592 | 822 | 1653 | A great debate! | 1 | text |