-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What about hyphens? #94
Comments
Why do you think hyphens shouldn't be shown? tesseract returns the recognized text in the same format as the original text. So if the original text as hyphenated words, so will the recognition result. In plain-text mode there is a post-processing function to join the hyphenated words. In hOCR mode, I'm not sure what you'd expect the result to be. At least in PDF overlay mode, I suppose it's correct to expect that the invisible text follows matches the text in the image, i.e. including hyphenations. Am I misunderstanding something? |
Is this clarified? |
In plain-text mode I finally discovered the post-processing function to join hyphenated words. In hOCR mode, I understand that OCR produce the same format as the original text; but an hyphenated word is not searchable as a normal word and I think it's a problem. There is no sense to search for "solu-tion". I don't know if there is a solution to that. |
I don't see any solution. If you merge the words, the text flow of the text overlay won't match the one of the image, and in the worst case the joined word will even overflow the page area. If you have a particular suggestion, let me know. In the meantime I'm closing this since I don't know anything that can be done about it. |
Please alow me to add a comment here although it is already closed. This is an interesting issue: Searching with "CTRL + F" a word that is splited. For example "solu-tion" or the german word "in-direkt" in a PDF with hidden text. In a PDF with invisible text which I have generated with gImageReader there are splited words that are not able for searching and some that are searchable. You can download it here: Search after the word "aufgeteilt" -> You will find the splited word "auf-geteilt". Are you aware of this? |
Thanks for this info. My PDF viewer does not do this, so I suppose it's a feature of some PDF viewers. That some are searchable and some not may depend on which character is used as hyphen (i.e. the dash vs the minus char)? |
I have checked this. And yes, this is exactly the point: This feature does only work when it has recognized the "spliting -". (I used the Acrobat Adobe Reader XI) |
I suppose I could have the program automatically fix the hyphen chars if it find one in the last word block of a text line - should be pretty safe, since I don't assume there are many cases where a line ought to end with the minus char. |
Yes, I agree. You did already implement this in the "preview gImageReader version" in the "plain- text"- refinement- functions: "put together words with hyphen". |
No that is something different, there I actually merge together the two word parts. Here I'd just ensure the correct hyphen char is used, but the word will remain hyphenated. |
Ah, yes I see. |
I've added some code for this in 6b074d9, please check whether it works for your document. If you still find some incorrect hypen chars, please indicate which these are. Updated Windows test builds are here: |
Thank you. Yes there are still some words who do not support this PDF-Reader feature. A) I made from a german original-picture a PDF with hidden text, with English (united states). Then I made also a "plain text" with the same settings. B) I made from a german original-picture a PDF with hidden text, with Deutsch (de). Then I made also a "plain text" with the same settings. I have saved the "plain text" You can download it here: http://docdro.id/qtIfcNQ In A) the following words do not support the PDF-Reader Feature: In B) the following words do not support the PDF-Reader Feature: Remark: It seems that the automatic fix of the hyphen chars does completely not work. All words who are divided by "—" do still not support the PDF- Reader feature. And the words divided by "-" do it. |
Please attach the hocr document containing the incorrect hyphens. |
Fixed in af8bc6, part of latest preview build. |
Thank you. Works with the lates preview. |
This issue might not be related to gImageReader. Is there some kind of hyphen management in Tessetract/gImageReader?
In "text plain" mode, hyphens are shown while they shouldn't: "solu-tion" is produced instead of "solution".
In hOCR mode also. Searching for "solution" in the hocr .html file or .pdf file won't match.
The hOCR standard seems to be aware about hyphens and says: "Soft hyphens must be represented using the HTML shy; entity."
I understand it can be quite difficult to manage both topological data (boxes, lines) and text flow at the same time in the XML tree of the hOCR file...
I don't have any idea how to deal with that.
The text was updated successfully, but these errors were encountered: