-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works #1282
Comments
I am happy to submit a patch, if you accept contributions. I would suggest having a command-line option like I am also open to other command-line flags, if you think that users learning about Unicode normalization is too much to impose on them. |
@sfllaw I appreciate the suggestion but I think what will really be needed is to insert markup into the PDF that allows competent PDF readers to see what is going on - and then testing to see if it helps sufficiently. If you want to attempt the relevant portion of the spec is below: |
@jbarlow83 If I understand correctly, you are suggesting that we use That is, for the above example with the scanned
Maybe I am misunderstanding your proposal, because it seems like this will depend on how the PDF reader deals with Because of this, I can’t think of an advantage over skipping NFKC altogether and rendering the NFC version in GlyphLessFont. Does Also, it looks like some PDF readers don’t handle non-trivial |
A few key points here:
When using parenthesis in a content stream the character IDs must be encoded in pdfdoc. However ½ is U+00BD which is In reference to the final point, GlyphLessFont defines itself as having a 1:1 mapping of Unicode to character ID, and then maps all character IDs to glyph ID 1, which is a blank cell. Actual text is supposed to supply an alternate list of characters IDs that are used for searching and copy-pasting, but not for rendering, such as in the example from the PDF reference manual, where hyphenation is present in the rendered text but eliminated in ActualText. All that said, it's quite possible most PDF viewers don't respect ActualText even when it's properly encoded. |
Thank you for all the details about the intricate Unicode handling in PDFs! However, I’d like to pop the stack and talk about the bigger picture. When OCRmyPDF encounters the ocrx_word I’d really like to solve this bug in a way that you’d be happy with. Could you please help me understand your proposal? |
Describe the proposed feature
HocrTransform.normalize_text
normalizes text using the NFKC1 compatibilty algorithm.OCRmyPDF/src/ocrmypdf/hocrtransform/_hocr.py
Lines 158 to 161 in 6895c2d
As explained in #1272, it does this so that searching for
Bauernstube
will matchBauernſtube
in naïve PDF readers.Unfortunately, this means that copy-pasting text out of the OCRed PDF will result in the former text, which will not match the rastered image that the user sees.
If there were an option to choose between NFKC and NFC normalization forms, then the author could opt to render the text more faithfully. In my case, I was surprised that
1½
was normalized to11/2
, which is a very different number!Footnotes
Unicode® Standard Annex #15: UNICODE NORMALIZATION FORMS ↩
The text was updated successfully, but these errors were encountered: