You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In performing OCR for older works (pre 19th century) certain typographical conventions that are different from modern printing will be encountered.
One of these is the use of a long-s in some words, as opposed to final or terminal s which is used at the end of words.
Currently these are not always reliably detected in Tesseract compared to other proprietary OCR API (such as Google's OCR service.)
In my own experience of cleaning up OCR from tesseract, long s has been mis-recognised as l or f, which in a lengthy run of text is time-consuming to manually check. ( I appreciate that NO OCR is ever going to perfectly recognize 'old' text.)
Describe the solution you'd like
The ability for Tesseract to detect (based on appropriate training data if needed) that it is dealing with different typographical conventions (such as those in pre 19th works), and adjust the recognition it is applying, so that long-s can be appropriately detected and output as an appropriate unicode character in the output, or if it's not possible to output long-s (such as when outputting to a format that doesn't support Unicode), that long-s detected is converted internally into short-s in the output.
The text was updated successfully, but these errors were encountered:
This repository is for the webassembly port of Tesseract. We don’t make any edits to the Tesseract recognition engine. Therefore, when encountering recognition issues you should check whether the results are similar using the CLI version of Tesseract (with the same version/settings). If they are, then the issue is outside of the scope of this project and should be directed to the main Tesseract project.
Is your feature request related to a problem? Please describe.
In performing OCR for older works (pre 19th century) certain typographical conventions that are different from modern printing will be encountered.
One of these is the use of a long-s in some words, as opposed to final or terminal s which is used at the end of words.
Currently these are not always reliably detected in Tesseract compared to other proprietary OCR API (such as Google's OCR service.)
In my own experience of cleaning up OCR from tesseract, long s has been mis-recognised as l or f, which in a lengthy run of text is time-consuming to manually check. ( I appreciate that NO OCR is ever going to perfectly recognize 'old' text.)
Describe the solution you'd like
The ability for Tesseract to detect (based on appropriate training data if needed) that it is dealing with different typographical conventions (such as those in pre 19th works), and adjust the recognition it is applying, so that long-s can be appropriately detected and output as an appropriate unicode character in the output, or if it's not possible to output long-s (such as when outputting to a format that doesn't support Unicode), that long-s detected is converted internally into short-s in the output.
The text was updated successfully, but these errors were encountered: