Image processing

Although Tesseract does some automated processing operations internally before character recognition, output accuracy is generally low for old documents. This project includes tailored image processing modules to improve both image quality and text recognition accuracy. Modules are based on open-source computer vision library OpenCV, python built-in functions, numpy, pandas or scikit-image. They are used to correct non-textual information (i.e., smear, shadow, noise, faint characters), binarize document (i.e., improve contrast between background and text), rotate image to correct skew angle, apply morphological operations to segment regions of interest, and so forth. Image processing chain is summarized below and detailed further in section Detailed processing chain.

pre-processing

Otsu binarization gives great results for the Observatories Year Book archive but can hardly be used on a lower quality document, such as the Stations of the Second Order. For the latter, dedicated binarization techniques for low quality documents are referenced in the literature and show promising results. Future works will give more credits to these methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image processing

Clone this wiki locally