Soft hyphen HTML entity in hOCR #479

tukusejssirs · 2021-01-04T11:47:23Z

Related: #94

hOCR Living Standard suggests to use  HTML entity. However, when I insert  in a box in hOCR within gImageReader’s output pane, the source is changed to &shy;, which is not correct.

IMHO the (soft) hyphens at the end of a line should be automatically converted to  in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract outputs these hyphens as soft hyphens (U+00AD); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.

For now, I need to replace the hyphens at the end of the lines with  using a text editor or sed:

text editor (using regex):

find: -</span>\n\s*</span>
replace: &shy;</span>\n\s*</span>

sed:

sed -zi 's|-</span>\n\s*</span>|\&shy\;</span>\n\s*</span>|g' <filename>

Update: When I replace those hyphens with  outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped). Therefore these is some kind of other other too.

Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.

The text was updated successfully, but these errors were encountered:

tukusejssirs mentioned this issue Jan 4, 2021

Suggestions to optimise and improve gImageReader #480

Closed

tukusejssirs mentioned this issue Jan 12, 2021

Corrupted PDF file export with hOCR data #486

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft hyphen HTML entity in hOCR #479

Soft hyphen HTML entity in hOCR #479

tukusejssirs commented Jan 4, 2021 •

edited

Loading

Soft hyphen HTML entity in hOCR #479

Soft hyphen HTML entity in hOCR #479

Comments

tukusejssirs commented Jan 4, 2021 • edited Loading

tukusejssirs commented Jan 4, 2021 •

edited

Loading