What about hyphens? #94

CharlesNepote · 2016-09-06T23:30:40Z

This issue might not be related to gImageReader. Is there some kind of hyphen management in Tessetract/gImageReader?
In "text plain" mode, hyphens are shown while they shouldn't: "solu-tion" is produced instead of "solution".
In hOCR mode also. Searching for "solution" in the hocr .html file or .pdf file won't match.

The hOCR standard seems to be aware about hyphens and says: "Soft hyphens must be represented using the HTML shy; entity."
I understand it can be quite difficult to manage both topological data (boxes, lines) and text flow at the same time in the XML tree of the hOCR file...
I don't have any idea how to deal with that.

manisandro · 2016-09-07T07:22:40Z

Why do you think hyphens shouldn't be shown? tesseract returns the recognized text in the same format as the original text. So if the original text as hyphenated words, so will the recognition result. In plain-text mode there is a post-processing function to join the hyphenated words. In hOCR mode, I'm not sure what you'd expect the result to be. At least in PDF overlay mode, I suppose it's correct to expect that the invisible text follows matches the text in the image, i.e. including hyphenations. Am I misunderstanding something?

manisandro · 2016-09-09T21:23:07Z

Is this clarified?

CharlesNepote · 2016-09-10T15:34:37Z

In plain-text mode I finally discovered the post-processing function to join hyphenated words.

In hOCR mode, I understand that OCR produce the same format as the original text; but an hyphenated word is not searchable as a normal word and I think it's a problem. There is no sense to search for "solu-tion". I don't know if there is a solution to that.

manisandro · 2016-09-10T16:43:52Z

I don't see any solution. If you merge the words, the text flow of the text overlay won't match the one of the image, and in the worst case the joined word will even overflow the page area. If you have a particular suggestion, let me know. In the meantime I'm closing this since I don't know anything that can be done about it.

Golddouble · 2016-09-10T18:34:22Z

Please alow me to add a comment here although it is already closed.

This is an interesting issue: Searching with "CTRL + F" a word that is splited. For example "solu-tion" or the german word "in-direkt" in a PDF with hidden text.

In a PDF with invisible text which I have generated with gImageReader there are splited words that are not able for searching and some that are searchable. You can download it here:

http://docdro.id/Q6phyat

Search after the word "aufgeteilt" -> You will find the splited word "auf-geteilt".
When you mark the part "auf-" by double-click then the whole word is marked.
When you mark the part "auf-" by double-click and copy it in the clipboard you can find in the clipboard the unsplited word "aufgeteilt".

Are you aware of this?

manisandro · 2016-09-10T18:38:51Z

Thanks for this info. My PDF viewer does not do this, so I suppose it's a feature of some PDF viewers. That some are searchable and some not may depend on which character is used as hyphen (i.e. the dash vs the minus char)?

Golddouble · 2016-09-10T18:49:37Z

I have checked this. And yes, this is exactly the point: This feature does only work when it has recognized the "spliting -". (I used the Acrobat Adobe Reader XI)

manisandro · 2016-09-10T18:57:48Z

I suppose I could have the program automatically fix the hyphen chars if it find one in the last word block of a text line - should be pretty safe, since I don't assume there are many cases where a line ought to end with the minus char.

Golddouble · 2016-09-10T19:09:37Z

Yes, I agree. You did already implement this in the "preview gImageReader version" in the "plain- text"- refinement- functions: "put together words with hyphen".

manisandro · 2016-09-10T19:11:59Z

No that is something different, there I actually merge together the two word parts. Here I'd just ensure the correct hyphen char is used, but the word will remain hyphenated.

Golddouble · 2016-09-10T19:20:48Z

Ah, yes I see.

manisandro · 2016-09-10T20:42:52Z

I've added some code for this in 6b074d9, please check whether it works for your document. If you still find some incorrect hypen chars, please indicate which these are. Updated Windows test builds are here:

Golddouble · 2016-09-11T08:53:54Z

Thank you.

Yes there are still some words who do not support this PDF-Reader feature.

A) I made from a german original-picture a PDF with hidden text, with English (united states). Then I made also a "plain text" with the same settings.

B) I made from a german original-picture a PDF with hidden text, with Deutsch (de). Then I made also a "plain text" with the same settings.

I have saved the "plain text" You can download it here: http://docdro.id/qtIfcNQ
The original document: http://docdro.id/Q6phyat

In A) the following words do not support the PDF-Reader Feature:
Ni-veau, --------- Ein-fluss, auseinander-brach, Mecha-nismus, --------------, Ge-schäft, auf-geteilt, un-bekannt, sei-nem, -------------, Me-chanismus

In B) the following words do not support the PDF-Reader Feature:
Ni-veau, Na-hen, Ein-fluss, auseinander-brach, Mecha-nismus, Kapitalismus, Ge-schäft, ------------, un-bekannt, ----------, Ptole-mäus, Me-chanismus

Remark: It seems that the automatic fix of the hyphen chars does completely not work. All words who are divided by "—" do still not support the PDF- Reader feature. And the words divided by "-" do it.

manisandro · 2016-09-11T09:10:36Z

Please attach the hocr document containing the incorrect hyphens.

Golddouble · 2016-09-11T10:12:04Z

B) http://docdro.id/zPd7YON

manisandro · 2016-09-11T17:23:48Z

Fixed in af8bc6, part of latest preview build.

Golddouble · 2016-09-13T19:02:16Z

Thank you. Works with the lates preview.

manisandro closed this as completed Sep 10, 2016

manisandro reopened this Sep 11, 2016

manisandro closed this as completed Sep 11, 2016

tukusejssirs mentioned this issue Jan 4, 2021

Soft hyphen HTML entity in hOCR #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What about hyphens? #94

What about hyphens? #94

CharlesNepote commented Sep 6, 2016

manisandro commented Sep 7, 2016

manisandro commented Sep 9, 2016

CharlesNepote commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016 •

edited

Loading

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016 •

edited

Loading

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 11, 2016

manisandro commented Sep 11, 2016

Golddouble commented Sep 11, 2016

manisandro commented Sep 11, 2016

Golddouble commented Sep 13, 2016

What about hyphens? #94

What about hyphens? #94

Comments

CharlesNepote commented Sep 6, 2016

manisandro commented Sep 7, 2016

manisandro commented Sep 9, 2016

CharlesNepote commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016 • edited Loading

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016 • edited Loading

manisandro commented Sep 10, 2016

Golddouble commented Sep 10, 2016

manisandro commented Sep 10, 2016

Golddouble commented Sep 11, 2016

manisandro commented Sep 11, 2016

Golddouble commented Sep 11, 2016

manisandro commented Sep 11, 2016

Golddouble commented Sep 13, 2016

Golddouble commented Sep 10, 2016 •

edited

Loading

Golddouble commented Sep 10, 2016 •

edited

Loading