Skip to content

jbaiter/hocrviewer-mirador

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HOCRViewer

Demo

Read books in HOCR format with Mirador.

Requirements

  • Python 3.5
  • Optional: An SQLite version that supports FTS5 (check with sqlite3 ":memory:" "PRAGMA compile_options;" |grep FTS5)

Installation

$ pip install -r requirements.txt

Data format

The HOCR file must contain all pages as ocr_page elements. These must have a title attribute that contains the following fields (as per the HOCR Specification):

  • ppageno: The physical page number
  • image: The relative path (from the HOCR file) to the page image
  • bbox: The dimensions of the image

Additionally, each ocr_page element must have an id attribute that assigns a unique identifier to the page.

Example:

<div class="ocr_page" id="page_0005"
     title="ppageno 4; image spyri_heidi_1880/00000005.tif; bbox 0 0 2013 2985"/>

Alternatively, HOCR files with accompanying images that are stored like the Google 1000 Books dataset (download instructions) can be indexed and viewed as well.

Usage

Simply point the application to a directory containing hOCR files and it will serve a web interface where you can view them:

$ python hocrviewer.py serve /mnt/data/hocr

You can alternatively index your files before serving them. This has two main advantages: It significantly reduces the response times for the manifests and annotations and it enables the search within the books (not yet usable from Mirador, but keep an eye on this PR).

To do so, run the index subcommand with the path to the directory with your HOCR files as the first argument. By default, the database will be written to ~/.config/hocrviewer/hocrviewer.db, but you can override this with the --db-path option that is passed before the subcommand:

$ python hocrviewer.py --db-path /tmp/test.db index /mnt/data/hocr

After the index has been created, run the application with the serve subcommand (making sure that you pass the same --db-path value as during indexing).

$ python hocrviewer.py --db-path /tmp/test.db serve

The application exposes all books as IIIF manifests at /iiif/<book_name>, where book_name is the file name of the HOCR file for the book without the .html extension.

Planned Features

  • Search across all books (backend done, user interface missing)
  • Edit OCR with a custom AnnotationEditor implementation for Mirador
  • Browse books in a paginated view outside of Mirador (which gets overwhelmed with large libraries)

About

View HOCR files with Mirador

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published