Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft hyphen HTML entity in hOCR #479

Open
tukusejssirs opened this issue Jan 4, 2021 · 0 comments
Open

Soft hyphen HTML entity in hOCR #479

tukusejssirs opened this issue Jan 4, 2021 · 0 comments

Comments

@tukusejssirs
Copy link

tukusejssirs commented Jan 4, 2021

Related: #94


hOCR Living Standard suggests to use ­ HTML entity. However, when I insert ­ in a box in hOCR within gImageReader’s output pane, the source is changed to ­, which is not correct.

IMHO the (soft) hyphens at the end of a line should be automatically converted to ­ in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract outputs these hyphens as soft hyphens (U+00AD); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.

For now, I need to replace the hyphens at the end of the lines with ­ using a text editor or sed:

  1. text editor (using regex):
find: -</span>\n\s*</span>
replace: &shy;</span>\n\s*</span>
  1. sed:
sed -zi 's|-</span>\n\s*</span>|\&shy\;</span>\n\s*</span>|g' <filename>

Update: When I replace those hyphens with &shy; outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped). Therefore these is some kind of other other too.

Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant