Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting text from PDF with embedded Identity-H font fails #330

Closed
gliwka opened this issue Jan 27, 2020 · 6 comments · Fixed by #521
Closed

Extracting text from PDF with embedded Identity-H font fails #330

gliwka opened this issue Jan 27, 2020 · 6 comments · Fixed by #521

Comments

@gliwka
Copy link

gliwka commented Jan 27, 2020

Summary

Extracting the text from an file with an embedded Identity-H font previously created by OpenPDF fails (sample file: identity-h.pdf). It just returns an empty string instead of "Hello World".

I'm using the latest release.

Discoveries during debugging

I was able to trace the issue to

char[] chars = pdfText.getOriginalChars();

Getting the original chars on the PDF text in case of Identity-H returns an empty char-array. The implementation returns an empty array if the encoding is not null. Setting the encoding to null doesn't help, because the logic in getOriginalChars doesn't return the proper byte-values.

https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/parser/ParsedText.java#L120:L124

The CID-Map decoding on the font directly seems to work fine. Calling decode on the CMapAwareDocumentFont using the value of the pdf string (graphicsState.getFont().decode(pdfText.value)) returns the proper expected text.

My attempts to fix this failed, I don't understand the logic behind the getOriginalChars method.

@gliwka gliwka changed the title Extracting text from PDF with Identity-H font fails Extracting text from PDF with embedded Identity-H font fails Jan 27, 2020
@andreasrosdal
Copy link
Contributor

Thanks for reporting. Pull requests welcome.

@asturio
Copy link
Member

asturio commented Oct 28, 2020

@gliwka ,
I'm also having some issues trying to extract Strings containing the "€" Symbol. I don't understand, if the problem was in the generation of the PDF, or reading it.

Have you found a work around for this issue?

@daviddurand
Copy link
Member

I wrote some of this code, and Identity-H and character encoding generally is a real pain.

When PDF was created 8-bits was the size of a character in any font, and Unicode support was added on later. This means that are various odd cases throughout the code to determine what the stored bytes for a PdfString actually mean (sometimes they are 8-bit values that map to unicode characters via a CMAP, sometimes they must be paired to form characters directly, sometimes they represent characters in one of the historical 8-bit encodings).

Basically, the reason getting code points was necessary is that fonts sometimes have variant glyphs (like ligatures , etc.) These may correspond to a unicode string of multiple characters: fi fl. If you want to preserve character dimensions, that causes trouble. This is discussed briefly in 0c341c8. As I need the text extractor to preserve that information, that was a primary consideration in my head.

Looking at the code for PdfString.getOriginalChars(), I see that if there is an encoding in place, line 254 returns an empty char[]. I think instead returning toUnicodeString().toCharArray() would make that method work for more files.

Just removing the check for missing or null encoding probably does not make sense (unless you can figure out how to interpret the bytes, I failed at that). Passing back the wrong results createsIndexOutOfBoundsExceptions galore.

There's a huge variation in text encoding and Unicode usage in PDF files, and so this is a bit of a minefield. It's worse because the parser implementation frequently uses Java Strings to hold arrays of 8-bit or 16-bit values that are not in any standard encoding. So it's not obvious what a given String represents in the code unless you know where it came from and where it's going. Fixing that mess in PdfString and all the classes that call it was beyond my time budget.

Given that the current else condition is obviously not sensible, if the change works for your file, it would be a good update.

I don't know when I'll be able to try rebuilding this, and I don't have your test file, so I'de recommend trying it if you can.

@andreasrosdal
Copy link
Contributor

Any news on this? 😊

@asturio
Copy link
Member

asturio commented Feb 4, 2021

Sill not solved. Don't have that much time :-(, but still wanting to fix it. Any help?

@Wugengxian
Copy link

Wugengxian commented Apr 23, 2021

@asturio @andreasrosdal I have fixed it. The idea to solve it is to use the information of /ToUnicode in font. Link to solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants