Recognise Long-s as distinct from short-s.. #455

afarlie · 2020-06-10T09:05:52Z

Is your feature request related to a problem? Please describe.

In performing OCR for older works (pre 19th century) certain typographical conventions that are different from modern printing will be encountered.

One of these is the use of a long-s in some words, as opposed to final or terminal s which is used at the end of words.

Currently these are not always reliably detected in Tesseract compared to other proprietary OCR API (such as Google's OCR service.)

In my own experience of cleaning up OCR from tesseract, long s has been mis-recognised as l or f, which in a lengthy run of text is time-consuming to manually check. ( I appreciate that NO OCR is ever going to perfectly recognize 'old' text.)

Describe the solution you'd like

The ability for Tesseract to detect (based on appropriate training data if needed) that it is dealing with different typographical conventions (such as those in pre 19th works), and adjust the recognition it is applying, so that long-s can be appropriately detected and output as an appropriate unicode character in the output, or if it's not possible to output long-s (such as when outputting to a format that doesn't support Unicode), that long-s detected is converted internally into short-s in the output.

Balearica · 2022-08-06T18:40:28Z

This repository is for the webassembly port of Tesseract. We don’t make any edits to the Tesseract recognition engine. Therefore, when encountering recognition issues you should check whether the results are similar using the CLI version of Tesseract (with the same version/settings). If they are, then the issue is outside of the scope of this project and should be directed to the main Tesseract project.

saurabharch mentioned this issue Jul 22, 2022

[Snyk] Security upgrade file-type from 3.9.0 to 16.5.4 saurabharch/tesseract.js#24

Open

u8558574xuyou7438 mentioned this issue Jul 22, 2022

[Snyk] Security upgrade file-type from 12.4.1 to 16.5.4 u8558574xuyou7438/tesseract.js#22

Open

ebarahona mentioned this issue Jul 23, 2022

[Snyk] Security upgrade file-type from 3.9.0 to 16.5.4 ebarahona/tesseract.js#7

Open

Balearica closed this as completed Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognise Long-s as distinct from short-s.. #455

Recognise Long-s as distinct from short-s.. #455

afarlie commented Jun 10, 2020

Balearica commented Aug 6, 2022

Recognise Long-s as distinct from short-s.. #455

Recognise Long-s as distinct from short-s.. #455

Comments

afarlie commented Jun 10, 2020

Balearica commented Aug 6, 2022