-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: option to remove soft hyphens #2161
Comments
I’ve just found this website which deals with so-called gremlins. It contains a workaround for this issue: you can replace any character using its Unicode code. For example, I wanted to replace a soft hyphen (U+00AD); I achieved it this way (note: files have to be encoded in UTF-8, obviously): # The space in "oča kávajú" is a soft hyphen
# `grep` displays a soft hyphen as a space, but when you select it in terminal, it will have a red background
$ grep -nP "\xad" advent\:_2._pondelok.txt
3:Bratia a sestry, Kristus je radosťou všetkých, ktorí ho oča kávajú. S dôverou mu prednesme svoje prosby.
# This command replaces all soft hyphens (U+00AD or \xad)
$ sed -i 's/\xc2\xad//g' advent\:_2._pondelok.txt
# And you can check if it was replaced
$ grep -nP "\xad" advent\:_2._pondelok.txt |
Did you see #728? |
@zdenop, no, I didn’t find then issue, though I tried to search the issue (probably I didn’t use the right keywords). And yes, this issue is a duplicate of yours, however I haven’t used |
I know that OCR-ing printed hyphenation can (or might?) be quite useful, when one wants to create an exact copy of physical document.
Otherwise, one would like to create a document, where a text processor would deal with the hyphenation. Or, as in my (current) case, I want to process a document and create a .txt document (without any special/control characters).
The problem is that when I processed some scanned documents with hyphenation using
tesseract -l [lang]
, all the hyphens at the end of a line would be substituted with soft hyphens (U+00AD). I didn’t realised this, because text editors would replace these characters as spaces, but when you usehead
command, these files contained<U+00AD>
characters.I would like to have an option to disable to add any special/control characters. Even better: remove any hyphens at the end of a line and join it without newline or space with the following line, however this is not really part of this issue.
My system and versions:
Kubuntu 18.04.1 LTS AMD64
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX
Found SSE
The text was updated successfully, but these errors were encountered: