feature request: option to remove soft hyphens #2161

tukusejssirs · 2019-01-15T16:48:42Z

I know that OCR-ing printed hyphenation can (or might?) be quite useful, when one wants to create an exact copy of physical document.

Otherwise, one would like to create a document, where a text processor would deal with the hyphenation. Or, as in my (current) case, I want to process a document and create a .txt document (without any special/control characters).

The problem is that when I processed some scanned documents with hyphenation using tesseract -l [lang], all the hyphens at the end of a line would be substituted with soft hyphens (U+00AD). I didn’t realised this, because text editors would replace these characters as spaces, but when you use head command, these files contained <U+00AD> characters.

I would like to have an option to disable to add any special/control characters. Even better: remove any hyphens at the end of a line and join it without newline or space with the following line, however this is not really part of this issue.

My system and versions:
Kubuntu 18.04.1 LTS AMD64
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX
Found SSE

The text was updated successfully, but these errors were encountered:

tukusejssirs · 2019-01-15T17:24:18Z

I’ve just found this website which deals with so-called gremlins. It contains a workaround for this issue: you can replace any character using its Unicode code. For example, I wanted to replace a soft hyphen (U+00AD); I achieved it this way (note: files have to be encoded in UTF-8, obviously):

# The space in "oča kávajú" is a soft hyphen
# `grep` displays a soft hyphen as a space, but when you select it in terminal, it will have a red background
$ grep -nP "\xad" advent\:_2._pondelok.txt 
3:Bratia a sestry, Kristus je radosťou všetkých, ktorí ho oča kávajú. S dôverou mu prednesme svoje prosby.

# This command replaces all soft hyphens (U+00AD or \xad)
$ sed -i 's/\xc2\xad//g' advent\:_2._pondelok.txt 

# And you can check if it was replaced
$ grep -nP "\xad" advent\:_2._pondelok.txt

zdenop · 2019-01-15T18:08:24Z

Did you see #728?

tukusejssirs · 2019-01-15T18:19:25Z

@zdenop, no, I didn’t find then issue, though I tried to search the issue (probably I didn’t use the right keywords).

And yes, this issue is a duplicate of yours, however I haven’t used hocr, only tesseract -l slk in Linux terminal (i.e. without any gui). Therefore, I agologise closing this issue. Could you mark this issue as duplicate please? :)

tukusejssirs closed this as completed Jan 15, 2019

stweil added the duplicate label Jan 16, 2019

tukusejssirs mentioned this issue Jan 4, 2021

Soft hyphen HTML entity in hOCR manisandro/gImageReader#479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: option to remove soft hyphens #2161

feature request: option to remove soft hyphens #2161

tukusejssirs commented Jan 15, 2019

tukusejssirs commented Jan 15, 2019 •

edited

Loading

zdenop commented Jan 15, 2019

tukusejssirs commented Jan 15, 2019

feature request: option to remove soft hyphens #2161

feature request: option to remove soft hyphens #2161

Comments

tukusejssirs commented Jan 15, 2019

tukusejssirs commented Jan 15, 2019 • edited Loading

zdenop commented Jan 15, 2019

tukusejssirs commented Jan 15, 2019

tukusejssirs commented Jan 15, 2019 •

edited

Loading