-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting text from PDF with embedded Identity-H font fails #330
Comments
Thanks for reporting. Pull requests welcome. |
@gliwka , Have you found a work around for this issue? |
I wrote some of this code, and Identity-H and character encoding generally is a real pain. When PDF was created 8-bits was the size of a character in any font, and Unicode support was added on later. This means that are various odd cases throughout the code to determine what the stored bytes for a Basically, the reason getting code points was necessary is that fonts sometimes have variant glyphs (like ligatures Looking at the code for Just removing the check for missing or null encoding probably does not make sense (unless you can figure out how to interpret the bytes, I failed at that). Passing back the wrong results creates There's a huge variation in text encoding and Unicode usage in PDF files, and so this is a bit of a minefield. It's worse because the parser implementation frequently uses Java Given that the current else condition is obviously not sensible, if the change works for your file, it would be a good update. I don't know when I'll be able to try rebuilding this, and I don't have your test file, so I'de recommend trying it if you can. |
Any news on this? 😊 |
Sill not solved. Don't have that much time :-(, but still wanting to fix it. Any help? |
@asturio @andreasrosdal I have fixed it. The idea to solve it is to use the information of /ToUnicode in font. Link to solution |
Summary
Extracting the text from an file with an embedded Identity-H font previously created by OpenPDF fails (sample file: identity-h.pdf). It just returns an empty string instead of "Hello World".
I'm using the latest release.
Discoveries during debugging
I was able to trace the issue to
OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/parser/ParsedText.java
Line 283 in bf9f968
Getting the original chars on the PDF text in case of Identity-H returns an empty char-array. The implementation returns an empty array if the encoding is not null. Setting the encoding to null doesn't help, because the logic in getOriginalChars doesn't return the proper byte-values.
https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/parser/ParsedText.java#L120:L124
The CID-Map decoding on the font directly seems to work fine. Calling decode on the CMapAwareDocumentFont using the value of the pdf string (graphicsState.getFont().decode(pdfText.value)) returns the proper expected text.
My attempts to fix this failed, I don't understand the logic behind the getOriginalChars method.
The text was updated successfully, but these errors were encountered: