You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hOCR Living Standard suggests to use ­ HTML entity. However, when I insert ­ in a box in hOCR within gImageReader’s output pane, the source is changed to ­, which is not correct.
IMHO the (soft) hyphens at the end of a line should be automatically converted to ­ in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract outputs these hyphens as soft hyphens (U+00AD); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.
For now, I need to replace the hyphens at the end of the lines with ­ using a text editor or sed:
sed -zi 's|-</span>\n\s*</span>|\­\;</span>\n\s*</span>|g'<filename>
Update: When I replace those hyphens with ­ outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped). Therefore these is some kind of other other too.
Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.
The text was updated successfully, but these errors were encountered:
Related: #94
hOCR Living Standard suggests to use
­
HTML entity. However, when I insert­
in a box in hOCR within gImageReader’s output pane, the source is changed to&shy;
, which is not correct.IMHO the (soft) hyphens at the end of a line should be automatically converted to
­
in hOCR and to regular hyphens in the text output. As far as I remember well,tesseract
outputs these hyphens as soft hyphens (U+00AD
); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.For now, I need to replace the hyphens at the end of the lines with
­
using a text editor orsed
:sed
:Update: When I replace those hyphens with
­
outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped
). Therefore these is some kind of other other too.Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.
The text was updated successfully, but these errors were encountered: