Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about hyphens? #94

Closed
CharlesNepote opened this issue Sep 6, 2016 · 17 comments
Closed

What about hyphens? #94

CharlesNepote opened this issue Sep 6, 2016 · 17 comments

Comments

@CharlesNepote
Copy link
Contributor

This issue might not be related to gImageReader. Is there some kind of hyphen management in Tessetract/gImageReader?
In "text plain" mode, hyphens are shown while they shouldn't: "solu-tion" is produced instead of "solution".
In hOCR mode also. Searching for "solution" in the hocr .html file or .pdf file won't match.

The hOCR standard seems to be aware about hyphens and says: "Soft hyphens must be represented using the HTML shy; entity."
I understand it can be quite difficult to manage both topological data (boxes, lines) and text flow at the same time in the XML tree of the hOCR file...
I don't have any idea how to deal with that.

@manisandro
Copy link
Owner

Why do you think hyphens shouldn't be shown? tesseract returns the recognized text in the same format as the original text. So if the original text as hyphenated words, so will the recognition result. In plain-text mode there is a post-processing function to join the hyphenated words. In hOCR mode, I'm not sure what you'd expect the result to be. At least in PDF overlay mode, I suppose it's correct to expect that the invisible text follows matches the text in the image, i.e. including hyphenations. Am I misunderstanding something?

@manisandro
Copy link
Owner

Is this clarified?

@CharlesNepote
Copy link
Contributor Author

In plain-text mode I finally discovered the post-processing function to join hyphenated words.

In hOCR mode, I understand that OCR produce the same format as the original text; but an hyphenated word is not searchable as a normal word and I think it's a problem. There is no sense to search for "solu-tion". I don't know if there is a solution to that.

@manisandro
Copy link
Owner

I don't see any solution. If you merge the words, the text flow of the text overlay won't match the one of the image, and in the worst case the joined word will even overflow the page area. If you have a particular suggestion, let me know. In the meantime I'm closing this since I don't know anything that can be done about it.

@Golddouble
Copy link

Please alow me to add a comment here although it is already closed.

This is an interesting issue: Searching with "CTRL + F" a word that is splited. For example "solu-tion" or the german word "in-direkt" in a PDF with hidden text.

In a PDF with invisible text which I have generated with gImageReader there are splited words that are not able for searching and some that are searchable. You can download it here:

http://docdro.id/Q6phyat

Search after the word "aufgeteilt" -> You will find the splited word "auf-geteilt".
When you mark the part "auf-" by double-click then the whole word is marked.
When you mark the part "auf-" by double-click and copy it in the clipboard you can find in the clipboard the unsplited word "aufgeteilt".

Are you aware of this?

@manisandro
Copy link
Owner

Thanks for this info. My PDF viewer does not do this, so I suppose it's a feature of some PDF viewers. That some are searchable and some not may depend on which character is used as hyphen (i.e. the dash vs the minus char)?

@Golddouble
Copy link

Golddouble commented Sep 10, 2016

I have checked this. And yes, this is exactly the point: This feature does only work when it has recognized the "spliting -". (I used the Acrobat Adobe Reader XI)

@manisandro
Copy link
Owner

I suppose I could have the program automatically fix the hyphen chars if it find one in the last word block of a text line - should be pretty safe, since I don't assume there are many cases where a line ought to end with the minus char.

@Golddouble
Copy link

Golddouble commented Sep 10, 2016

Yes, I agree. You did already implement this in the "preview gImageReader version" in the "plain- text"- refinement- functions: "put together words with hyphen".

@manisandro
Copy link
Owner

No that is something different, there I actually merge together the two word parts. Here I'd just ensure the correct hyphen char is used, but the word will remain hyphenated.

@Golddouble
Copy link

Ah, yes I see.

@manisandro
Copy link
Owner

I've added some code for this in 6b074d9, please check whether it works for your document. If you still find some incorrect hypen chars, please indicate which these are. Updated Windows test builds are here:

@Golddouble
Copy link

Thank you.

Yes there are still some words who do not support this PDF-Reader feature.

A) I made from a german original-picture a PDF with hidden text, with English (united states). Then I made also a "plain text" with the same settings.

B) I made from a german original-picture a PDF with hidden text, with Deutsch (de). Then I made also a "plain text" with the same settings.

I have saved the "plain text" You can download it here: http://docdro.id/qtIfcNQ
The original document: http://docdro.id/Q6phyat

In A) the following words do not support the PDF-Reader Feature:
Ni-veau, --------- Ein-fluss, auseinander-brach, Mecha-nismus, --------------, Ge-schäft, auf-geteilt, un-bekannt, sei-nem, -------------, Me-chanismus

In B) the following words do not support the PDF-Reader Feature:
Ni-veau, Na-hen, Ein-fluss, auseinander-brach, Mecha-nismus, Kapitalismus, Ge-schäft, ------------, un-bekannt, ----------, Ptole-mäus, Me-chanismus

Remark: It seems that the automatic fix of the hyphen chars does completely not work. All words who are divided by "—" do still not support the PDF- Reader feature. And the words divided by "-" do it.

@manisandro
Copy link
Owner

Please attach the hocr document containing the incorrect hyphens.

@manisandro manisandro reopened this Sep 11, 2016
@Golddouble
Copy link

B) http://docdro.id/zPd7YON

@manisandro
Copy link
Owner

Fixed in af8bc6, part of latest preview build.

@Golddouble
Copy link

Thank you. Works with the lates preview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants