Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognise Long-s as distinct from short-s.. #455

Closed
afarlie opened this issue Jun 10, 2020 · 1 comment
Closed

Recognise Long-s as distinct from short-s.. #455

afarlie opened this issue Jun 10, 2020 · 1 comment

Comments

@afarlie
Copy link

afarlie commented Jun 10, 2020

Is your feature request related to a problem? Please describe.

In performing OCR for older works (pre 19th century) certain typographical conventions that are different from modern printing will be encountered.

One of these is the use of a long-s in some words, as opposed to final or terminal s which is used at the end of words.

Currently these are not always reliably detected in Tesseract compared to other proprietary OCR API (such as Google's OCR service.)

In my own experience of cleaning up OCR from tesseract, long s has been mis-recognised as l or f, which in a lengthy run of text is time-consuming to manually check. ( I appreciate that NO OCR is ever going to perfectly recognize 'old' text.)

Describe the solution you'd like

The ability for Tesseract to detect (based on appropriate training data if needed) that it is dealing with different typographical conventions (such as those in pre 19th works), and adjust the recognition it is applying, so that long-s can be appropriately detected and output as an appropriate unicode character in the output, or if it's not possible to output long-s (such as when outputting to a format that doesn't support Unicode), that long-s detected is converted internally into short-s in the output.

@Balearica
Copy link
Member

This repository is for the webassembly port of Tesseract. We don’t make any edits to the Tesseract recognition engine. Therefore, when encountering recognition issues you should check whether the results are similar using the CLI version of Tesseract (with the same version/settings). If they are, then the issue is outside of the scope of this project and should be directed to the main Tesseract project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants