-
Notifications
You must be signed in to change notification settings - Fork 7
How to create searchable Fulltext Data for DFG Viewer
The OCR-D Pilot Phase at ULB Saxony-Anhalt (2020) was aimed towards the production of searchable fulltext data for use in the DFG- as well as IIIF Viewers. Already digitized materials from the library - specifically from the historical documents at the inhouse section of ULB's collections - formed the basis for the workflow, as they are readily available via OAI-PMH API. The goal was a workflow that operated under real use-case scenarios rather than a theoretical construction based on assumption.
The ALTO XML data resulted from a Makefile Workflow-Configuration of the following processors (and their parameters, respectively):
ocrd-im6convert
-
ocrd-olena-binarize
("impl": "sauvola-ms-split"
) ocrd-anybaseocr-crop
-
ocrd-cis-ocropy-deskew
("level-of-operation": "page", "maxskew": 5
) -
ocrd-tesserocr-segment-region
("padding": 5, "find_tables": false
) -
ocrd-segment-repair
("plausibilize": true, "plausibilize_merge_min_overlap": 0.7
) ocrd-cis-ocropy-clip
-
ocrd-cis-ocropy-segment
("spread": 2.4
) ocrd-cis-ocropy-dewarp
-
ocrd-tesserocr-recognize
("overwrite_segments": true, "model" : "gt4hist_5000k+Fraktur+frk+deu"
) ocrd-fileformat-transform
To enhance runtime performance and stability, each single digitized book's pages were processed in up to 12 single Docker-Containers from Docker-Image ocrd/all:2020-08-04
. From overall 50.000 pages, about 500 (around 1%) were dropped due errors. These errors are almost completely related to rather difficult input data (large illustrations, maps, tables, handwritten notes and alike).
During tests, the parallelization of the entire workflow proved to be more error-prone and thus was dropped in favour of the page separation.
Because of the pages' partition, all files need to be integrated into the METS/MODS afterwards. This step is not required if the entire work is processed at once, but made necessary due to the separation. It can be initiated by creating a OCR-D-Workspace from an OAI-Record (following snippets implemented with standard python 3.6 using additional functionalities from ocrd_core-package):
from ocrd.resolver import (
Resolver
)
def ocrd_workspace_clone(oai_identifier):
"""Wrap ocrd workspace clone in curdir"""
# clone from oai-pmh
mets_url = f"{OAI_URL}{OAI_PARAMS}&identifier={oai_identifier}"
resolver = Resolver()
workspace = resolver.workspace_from_url(
mets_url=mets_url,
dst_dir='.',
download=False
)
workspace.save_mets()
def ocrd_workspace_setup(root_dir, sub_dir, file_id):
"""Wrap ocrd workspace init and add single file"""
# init workspace
the_dir = os.path.abspath(sub_dir)
resolver = Resolver()
workspace = resolver.workspace_from_nothing(
directory=the_dir
)
workspace.save_mets()
# add image
image_src = f"{root_dir}/MAX/{file_id}.jpg"
resolver.download_to_directory(
the_dir,
image_src,
subdir='MAX')
kwargs = {
'fileGrp': 'MAX',
'ID': 'IMG_' + file_id,
'mimetype': 'image/jpeg',
'url': f"MAX/{file_id}.jpg"}
workspace.mets.add_file(**kwargs)
workspace.save_mets()
# proceed with workflow ...
The actual execution of the OCR-D-Container is wrapped via python to allow scaling:
# case a: run n workspaces parallel
def run_ocr_workspaces(*args):
ocr_dir = args[0][0]
part_by = args[0][1]
os.chdir(ocr_dir)
user_id = os.getuid()
cmd_clean = f'docker rm --force {CNT_NAME}'
subprocess.run(cmd_clean, shell=True, check=True)
cmd = f'docker run --rm --name {CNT_NAME} -u "{user_id}" -w /data -v "{ocr_dir}":/data -v {TESSDIR_HOST}:{TESSDIR_CNT} {IMAGE} ocrd-make all -j"{part_by}" -f {MAKEFILE} .'
try:
result = subprocess.run(cmd, shell=True, check=True, timeout=7200)
# ... analyze result
except subprocess.CalledProcessError as exc:
# ... handle subprocess failure
# case b: run n containers with a single page
def run_ocr_page(*args):
ocr_dir = args[0][0]
os.chdir(ocr_dir)
user_id = os.getuid()
cmd = f'docker run --rm -u "{user_id}" -w /data -v "{ocr_dir}":/data -v {TESSDIR_HOST}:{TESSDIR_CNT} {IMAGE} ocrd-make -f {MAKEFILE_PP} .'
try:
result = subprocess.run(cmd, shell=True, check=True, timeout=1800)
# ... analyze result
except subprocess.CalledProcessError as exc:
# ... handle subprocess failure
To enhance throughput, the calls are executed via standard python process pooling in the main script:
import concurrent.futures
def create_ocr(image_path):
# ... additional setup
file_id = os.path.basename(image_path).split('.')[0]
workdir_sub = os.path.join(migration.migration_workdir, file_id)
try:
run_ocr_page(workdir_sub)
# ... handle further exceptions and results
if __name__ == "__main__":
APP_ARGUMENTS = argparse.ArgumentParser()
APP_ARGUMENTS.add_argument(
"-p",
"--part_by",
required=False,
help="partition size for workflow")
ARGS = vars(APP_ARGUMENTS.parse_args())
if ARGS['part_by']:
PART_BY = int(ARGS['part_by'])
else:
PART_BY = 4
# additional initialization ...
try:
with concurrent.futures.ProcessPoolExecutor(max_workers=PART_BY) as executor:
outcomes = list(executor.map(create_ocr, image_paths))
# ... analyze outcomes
except TimeoutError:
(exc_type, value, traceback) = sys.exc_info()
migration.the_logger.error(
"Run into timeout: '%s'(%s): %s",
value,
exc_type,
traceback)
# further processing ...
As indexing is one of the most important preconditions for fulltext search, some adjustments were required. The ALTO files produced did not include a spacing element SP
between each String
element of an ALTO TextLine
element. This resulted in undesirable effects for the depiction of the fulltext, e.g. a spaceless text being rendered in the viewers. To remedy this, a postprocessing step for automatically adding the SP
elements was implemented.
As adding footers (for URN and digitizing library display) changes the dimensions of the image, these also need to be adjusted in the ALTO XML to ensure scaling still functions as expected for the text coordinates.
DFG-Viewer enables fulltext display at line level and additionally fulltext Search. IIIF-Viewer is capable for fulltext search if the IIIF-Manifest includes information on search API.
The Share_it Repository of ULB Saxony-Anhalt uses a Solr-based indexing of fulltext data from solr-ocrhighlighting. If your administrative METS-section contains additional information for the DFG-Viewer specified as fulltext SRU to enable fulltext search, the connection of search results to the corresponding images (and even the position of these images) needs to be specified. Therefore each ALTO file needs to be linked with the corresponding image, i.e. the Page ID-Attribute needs be equalized to the image's physical name:
<Layout>
<Page ID="OCR-D-BINPAGE_0001" HEIGHT="2395" .... >
<!--
exchange pre-set ID from ocrd-workflow with physical filename
take care of new image height
-->
<Layout>
<Page ID="1056985" HEIGHT="2495" .... >
Postprocessings can be achieved with help from the lxml
-package:
import lxml.etree as ET
def clear_alto(file_path, n_file, new_height):
"""
Ensure:
- each ALTO Page-Element references proper image
- drop geometrical information not necessary for presentation
- add SP-element between each word
"""
xml_tree = ET.parse(file_path)
xml_root = xml_tree.getroot()
# update page ID
page = xml_root.find('.//alto:Page', XMLNS_POST)
page.attrib['ID'] = f'p{n_file}'
page.attrib['HEIGHT'] = new_height
# remove geometrical elements not used for presentation
_remove_elements(
xml_root, ['alto:Shape', 'alto:Illustration', 'alto:GraphicalElement'])
# add element "Space"
lines = xml_root.findall('.//alto:TextLine', XMLNS_POST)
for line in lines:
indices = len(line) - 2
i = indices
while i >= 0:
line[i].addnext(ET.XML('<SP />'))
i = i - 1
write_xml(xml_tree, file_path)
def _remove_elements(xml_root, tags):
for tag in tags:
removals = xml_root.findall(f'.//{tag}', XMLNS_POST)
for rem in removals:
parent = rem.getparent()
parent.remove(rem)
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows