Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: option to remove soft hyphens #2161

Closed
tukusejssirs opened this issue Jan 15, 2019 · 3 comments
Closed

feature request: option to remove soft hyphens #2161

tukusejssirs opened this issue Jan 15, 2019 · 3 comments

Comments

@tukusejssirs
Copy link

I know that OCR-ing printed hyphenation can (or might?) be quite useful, when one wants to create an exact copy of physical document.

Otherwise, one would like to create a document, where a text processor would deal with the hyphenation. Or, as in my (current) case, I want to process a document and create a .txt document (without any special/control characters).

The problem is that when I processed some scanned documents with hyphenation using tesseract -l [lang], all the hyphens at the end of a line would be substituted with soft hyphens (U+00AD). I didn’t realised this, because text editors would replace these characters as spaces, but when you use head command, these files contained <U+00AD> characters.

I would like to have an option to disable to add any special/control characters. Even better: remove any hyphens at the end of a line and join it without newline or space with the following line, however this is not really part of this issue.

My system and versions:
Kubuntu 18.04.1 LTS AMD64
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX
Found SSE

@tukusejssirs
Copy link
Author

tukusejssirs commented Jan 15, 2019

I’ve just found this website which deals with so-called gremlins. It contains a workaround for this issue: you can replace any character using its Unicode code. For example, I wanted to replace a soft hyphen (U+00AD); I achieved it this way (note: files have to be encoded in UTF-8, obviously):

# The space in "oča kávajú" is a soft hyphen
# `grep` displays a soft hyphen as a space, but when you select it in terminal, it will have a red background
$ grep -nP "\xad" advent\:_2._pondelok.txt 
3:Bratia a sestry, Kristus je radosťou všetkých, ktorí ho oča­ kávajú. S dôverou mu prednesme svoje prosby.

# This command replaces all soft hyphens (U+00AD or \xad)
$ sed -i 's/\xc2\xad//g' advent\:_2._pondelok.txt 

# And you can check if it was replaced
$ grep -nP "\xad" advent\:_2._pondelok.txt 

@zdenop
Copy link
Contributor

zdenop commented Jan 15, 2019

Did you see #728?

@tukusejssirs
Copy link
Author

@zdenop, no, I didn’t find then issue, though I tried to search the issue (probably I didn’t use the right keywords).

And yes, this issue is a duplicate of yours, however I haven’t used hocr, only tesseract -l slk in Linux terminal (i.e. without any gui). Therefore, I agologise closing this issue. Could you mark this issue as duplicate please? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants