Read books in HOCR format with Mirador.
- Python 3.5
- Optional: An SQLite version that supports FTS5 (check with
sqlite3 ":memory:" "PRAGMA compile_options;" |grep FTS5
)
$ pip install -r requirements.txt
The HOCR file must contain all pages as ocr_page
elements. These must have
a title
attribute that contains the following fields (as per the
HOCR Specification):
ppageno
: The physical page numberimage
: The relative path (from the HOCR file) to the page imagebbox
: The dimensions of the image
Additionally, each ocr_page
element must have an id
attribute that
assigns a unique identifier to the page.
Example:
<div class="ocr_page" id="page_0005"
title="ppageno 4; image spyri_heidi_1880/00000005.tif; bbox 0 0 2013 2985"/>
Alternatively, HOCR files with accompanying images that are stored like the Google 1000 Books dataset (download instructions) can be indexed and viewed as well.
Simply point the application to a directory containing hOCR files and it will serve a web interface where you can view them:
$ python hocrviewer.py serve /mnt/data/hocr
You can alternatively index your files before serving them. This has two main advantages: It significantly reduces the response times for the manifests and annotations and it enables the search within the books (not yet usable from Mirador, but keep an eye on this PR).
To do so, run the index
subcommand with the path to the directory with
your HOCR files as the first argument. By default, the database will be
written to ~/.config/hocrviewer/hocrviewer.db
, but you can override this
with the --db-path
option that is passed before the subcommand:
$ python hocrviewer.py --db-path /tmp/test.db index /mnt/data/hocr
After the index has been created, run the application with the serve
subcommand (making sure that you pass the same --db-path
value as during
indexing).
$ python hocrviewer.py --db-path /tmp/test.db serve
The application exposes all books as IIIF manifests at
/iiif/<book_name>
, where book_name
is the file name of the HOCR file
for the book without the .html
extension.
- Search across all books (backend done, user interface missing)
- Edit OCR with a custom
AnnotationEditor
implementation for Mirador - Browse books in a paginated view outside of Mirador (which gets overwhelmed with large libraries)