Consider supporting ActualText #41

badicsalex · 2022-09-19T19:45:41Z

I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.

For example I have the following: https://stickman.hu/junk/actualtext_example.pdf

Here, the line

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 EU rendeletek” szövegrész helyébe

Extracts as

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 eU rendeletek” szövegrész helyébe
                                                                             ^
                                                                             |
                                                                         Lowercase

Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.

Now why is this? For some reason, the CID for both E and e are mapped to the ASCII code point 101 (lowercase e) in the font.

Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:

op: Operation { operator: "BDC", operands: [/Span, <</ActualText (��^@E)>>] }
op: Operation { operator: "Td", operands: [30.888, 0] }
op: Operation { operator: "Tj", operands: [(E)] }
op: Operation { operator: "EMC", operands: [] }

The ActualText thing here is described in the PDF standard "14.9.4 Replacement Text", and has a special code path in poppler: https://github.com/freedesktop/poppler/blob/315ab3006fb24bf47b595343e6a3e90995f2a588/poppler/Gfx.cc#L5052-L5059

As far as I see, handling this case would need some refactoring around show_text, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.

P.S. 1: It seems like this guy had a related issue back in the day: https://stackoverflow.com/questions/17737776/pdf-text-extraction-issue-font-capitalization-inconsistencies

P.S. 2: In the end, I might just expose the CID on the output_character interface and do the same workaround I did in python: https://github.com/badicsalex/hun_law_py/blob/master/hun_law/extractors/pdf.py#L88-L93

P.S. 3: Thanks for taking the time to fix some of the bugs I reported, I really appreciate it.

The text was updated successfully, but these errors were encountered:

See the comment and also jrmuizel/pdf-extract#41

See jrmuizel#41

badicsalex added a commit to badicsalex/hun_law_rs that referenced this issue Sep 19, 2022

parser/pdf: workaround for the weird ToUnicode casing bug

f0f66c1

See the comment and also jrmuizel/pdf-extract#41

badicsalex added a commit to badicsalex/hun_law_rs that referenced this issue Sep 19, 2022

parser/pdf: workaround for the weird ToUnicode casing bug

fb64011

See the comment and also jrmuizel/pdf-extract#41

badicsalex added a commit to badicsalex/pdf-extract-fhl that referenced this issue Sep 19, 2022

Pass CID to OutputDev.show_text implementers

24043ab

See jrmuizel#41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting ActualText #41

Consider supporting ActualText #41

badicsalex commented Sep 19, 2022

Consider supporting ActualText #41

Consider supporting ActualText #41

Comments

badicsalex commented Sep 19, 2022