Extracting text from PDF with embedded Identity-H font fails #330

gliwka · 2020-01-27T17:48:10Z

Summary

Extracting the text from an file with an embedded Identity-H font previously created by OpenPDF fails (sample file: identity-h.pdf). It just returns an empty string instead of "Hello World".

I'm using the latest release.

Discoveries during debugging

I was able to trace the issue to

OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/parser/ParsedText.java

Line 283 in bf9f968

char[] chars = pdfText.getOriginalChars();

Getting the original chars on the PDF text in case of Identity-H returns an empty char-array. The implementation returns an empty array if the encoding is not null. Setting the encoding to null doesn't help, because the logic in getOriginalChars doesn't return the proper byte-values.

https://github.com/LibrePDF/OpenPDF/blob/master/openpdf/src/main/java/com/lowagie/text/pdf/parser/ParsedText.java#L120:L124

The CID-Map decoding on the font directly seems to work fine. Calling decode on the CMapAwareDocumentFont using the value of the pdf string (graphicsState.getFont().decode(pdfText.value)) returns the proper expected text.

My attempts to fix this failed, I don't understand the logic behind the getOriginalChars method.

andreasrosdal · 2020-01-28T06:19:08Z

Thanks for reporting. Pull requests welcome.

asturio · 2020-10-28T18:15:21Z

@gliwka ,
I'm also having some issues trying to extract Strings containing the "€" Symbol. I don't understand, if the problem was in the generation of the PDF, or reading it.

Have you found a work around for this issue?

daviddurand · 2020-10-28T22:03:36Z

I wrote some of this code, and Identity-H and character encoding generally is a real pain.

When PDF was created 8-bits was the size of a character in any font, and Unicode support was added on later. This means that are various odd cases throughout the code to determine what the stored bytes for a PdfString actually mean (sometimes they are 8-bit values that map to unicode characters via a CMAP, sometimes they must be paired to form characters directly, sometimes they represent characters in one of the historical 8-bit encodings).

Basically, the reason getting code points was necessary is that fonts sometimes have variant glyphs (like ligatures ﬁ ﬂ, etc.) These may correspond to a unicode string of multiple characters: fi fl. If you want to preserve character dimensions, that causes trouble. This is discussed briefly in 0c341c8. As I need the text extractor to preserve that information, that was a primary consideration in my head.

Looking at the code for PdfString.getOriginalChars(), I see that if there is an encoding in place, line 254 returns an empty char[]. I think instead returning toUnicodeString().toCharArray() would make that method work for more files.

Just removing the check for missing or null encoding probably does not make sense (unless you can figure out how to interpret the bytes, I failed at that). Passing back the wrong results createsIndexOutOfBoundsExceptions galore.

There's a huge variation in text encoding and Unicode usage in PDF files, and so this is a bit of a minefield. It's worse because the parser implementation frequently uses Java Strings to hold arrays of 8-bit or 16-bit values that are not in any standard encoding. So it's not obvious what a given String represents in the code unless you know where it came from and where it's going. Fixing that mess in PdfString and all the classes that call it was beyond my time budget.

Given that the current else condition is obviously not sensible, if the change works for your file, it would be a good update.

I don't know when I'll be able to try rebuilding this, and I don't have your test file, so I'de recommend trying it if you can.

andreasrosdal · 2020-11-13T23:12:24Z

Any news on this? 😊

asturio · 2021-02-04T19:25:53Z

Sill not solved. Don't have that much time :-(, but still wanting to fix it. Any help?

Wugengxian · 2021-04-23T00:21:41Z

@asturio @andreasrosdal I have fixed it. The idea to solve it is to use the information of /ToUnicode in font. Link to solution

gliwka changed the title ~~Extracting text from PDF with Identity-H font fails~~ Extracting text from PDF with embedded Identity-H font fails Jan 27, 2020

andreasrosdal added bug help wanted enhancement and removed bug labels Jan 28, 2020

LibrePDF deleted a comment from mkl-public Apr 23, 2020

andreasrosdal added the bug label Apr 23, 2020

asturio self-assigned this Oct 28, 2020

jtjeferreira mentioned this issue Nov 5, 2020

Update to use OpenPDF 1.0.5 and remove dependency on patched iText 2.1.7.js6 version TIBCOSoftware/jasperreports#17

Closed

asturio pinned this issue Dec 4, 2020

mkl-public mentioned this issue Dec 16, 2020

Missing text from pdf file when calling PDFReader getPageContent #463

Closed

asturio mentioned this issue Apr 25, 2021

Bug fix 330: Text extract #521

Merged

asturio linked a pull request Apr 25, 2021 that will close this issue

Bug fix 330: Text extract #521

Merged

asturio closed this as completed in #521 Apr 25, 2021

asturio unpinned this issue May 2, 2021

asturio added this to the 1.3.26 milestone May 2, 2021

dominioon mentioned this issue Jan 31, 2023

Openpdf flyingsaucerproject/flyingsaucer#174

Closed

nishantkumar21stjul mentioned this issue Apr 25, 2024

Extracting text from PDFs with encodings- Identity-H, Roman fails, gives a blank response. jsvine/pdfplumber#1132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text from PDF with embedded Identity-H font fails #330

Extracting text from PDF with embedded Identity-H font fails #330

gliwka commented Jan 27, 2020

andreasrosdal commented Jan 28, 2020

asturio commented Oct 28, 2020

daviddurand commented Oct 28, 2020

andreasrosdal commented Nov 13, 2020

asturio commented Feb 4, 2021

Wugengxian commented Apr 23, 2021 •

edited

Loading

Extracting text from PDF with embedded Identity-H font fails #330

Extracting text from PDF with embedded Identity-H font fails #330

Comments

gliwka commented Jan 27, 2020

Summary

Discoveries during debugging

andreasrosdal commented Jan 28, 2020

asturio commented Oct 28, 2020

daviddurand commented Oct 28, 2020

andreasrosdal commented Nov 13, 2020

asturio commented Feb 4, 2021

Wugengxian commented Apr 23, 2021 • edited Loading

Wugengxian commented Apr 23, 2021 •

edited

Loading