TessTools: Tools for the use of Tesseract OCR in R

Interface to the Tesseract OCR command line tool (version 4) and parsing functions for the analysis of historical newspaper archives. This is under development.

Installation

Make sure you have the tesseract command line program installed and available in PATH. You can either Install Tesseract via pre-built binary package or build it from source.

$ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

You can install the development version of TessTools from GitHub with:

# install.packages("devtools")
devtools::install_github("OlivierBinette/TessTools")

Example

Download the first issue (1905) of the Duke Chronicle newspaper.

library(TessTools)

issueID = chronicle_meta[1, "local_id"]
zipfile = download_chronicle(issueID, outputdir="data-raw")

Run Tesseract OCR on the newspaper scans and extract text paragraphs together with their bounding boxes.

hocrfiles = hocr_from_zip(zipfile, outputdir="data-raw/hocr", exdir="data-raw/img")
#> Running tesseract-OCR on 4 image files.

# Extract paragraph text
text = paragraphs(hocrfiles)
text[[1]][9:11, ] # Some paragraphs on the first page

bbox1	bbox2	bbox3	bbox4	text
481	1251	1099	1314	HESPERIAN VS. COLUMBIAN.
361	1394	1225	1554	Sixteenth Annual Inter-Society Debate –Won By the Hesperian.
424	1592	822	1653	A great debate!

Visualize the result using hocrjs:

webpages = visualize_html(hocrfiles, outputdir="data-raw/html") # webpage is at data-raw/html/dchnp71001-html
browseURL(webpages[[1]]) # Note: bring up the hocrjs menu and select "show background image"

Ground truth

Paragraphs of the first issue have been annotated according to the article to which they belong.

# Ground truth for first page
vol1_paragraphs_truth[[1]][9:11, ]

	bbox1	bbox2	bbox3	bbox4	text	articleID	category
9	481	1251	1099	1314	HESPERIAN VS. COLUMBIAN.	1	title
10	361	1394	1225	1554	Sixteenth Annual Inter-Society Debate —Won By the Hesperian.	1	title
11	424	1592	822	1653	A great debate!	1	text

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
R		R
data-raw		data-raw
data		data
inst/configs		inst/configs
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
hocrjs.png		hocrjs.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TessTools: Tools for the use of Tesseract OCR in R

Installation

Example

Ground truth

About

Releases

Packages

Languages

Duke-Chronicle-Project/TessTools

Folders and files

Latest commit

History

Repository files navigation

TessTools: Tools for the use of Tesseract OCR in R

Installation

Example

Ground truth

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages