Skip to content

Duke-Chronicle-Project/TessTools

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TessTools: Tools for the use of Tesseract OCR in R

Lifecycle

Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.

Installation

Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

You can install the development version of TessTools from GitHub with:

# install.packages("devtools")
devtools::install_github("OlivierBinette/TessTools")

Example

Download the first issue (1905) of the Duke Chronicle newspaper.

library(TessTools)

issueID = chronicle_meta[1, "local_id"]
zipfile = download_chronicle(issueID, outputdir="data-raw")

Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.

hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img")
#> Running tesseract-OCR on 4 image files.

# Extract paragraph text
text = paragraphs(hocrfiles)
text[[1]][9:11, ] # Some paragraphs on the first page
bbox1 bbox2 bbox3 bbox4 text
481 1251 1099 1314 HESPERIAN VS. COLUMBIAN.
361 1394 1225 1554 Sixteenth Annual Inter-Society Debate –Won By the Hesperian.
424 1592 822 1653 A great debate!

Visualize the result using hocrjs:

webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html
browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"

Ground truth

Paragraphs of the first issue have been annotated according to the article to which they belong.

# Ground truth for first page
vol1_paragraphs_truth[[1]][9:11, ]
bbox1 bbox2 bbox3 bbox4 text articleID category note
9 481 1251 1099 1314 HESPERIAN VS. COLUMBIAN. 1 title
10 361 1394 1225 1554 Sixteenth Annual Inter-Society Debate —Won By the Hesperian. 1 title
11 424 1592 822 1653 A great debate! 1 text

About

Tools for the use of Tesseract OCR in R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%