Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

shartzog · 2025-01-26T22:45:23Z

The Type3 font specification in the PDF 1.7 standard allows producers to execute arbitrary glyph drawing commands on a per character code basis. This feature can be used to render non-text PDF content (e.g. charts and graphs) using the standard PDF text operators (Td, Tj, etc). It also allows producers to render text content visually without providing any mechanism for translating said content back to a true encoded character. For example, the drawing commands associated with character code 65 ("A") could be used to draw a "Z" or a unicorn or a fire breathing dragon or an "A". In such situations, extracting text in layout mode can result in massively inflated outputs, putting users at risk of OOM exceptions.

Environment

(pdfextnew) C:\Users\samha\pdf-extractor>python -m platform
Windows-10-10.0.22631-SP0
Python 3.10.6 | packaged by conda-forge | (main, Oct 24 2022, 16:02:16) [MSC v.1916 64 bit (AMD64)] on win32

(pdfextnew) C:\Users\samha\pdf-extractor>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0

Code + PDF

>>> from pypdf import PdfReader
>>> r = PdfReader('c:/users/samha/downloads/UninterpretableType3Font.pdf')
>>> layout_output = r.pages[0].extract_text(extraction_mode="layout")
>>> print(len(layout_output))
9947317

UninterpretableType3Font.pdf

Traceback

This issue does not directly result in any exception. However, the contents of layout_output in the sample above will contain no information of value (only long strings of spaces interspersed with an occasional named reference to a CharProcs entry), and due to its consumption of nearly 10MB of memory, user pipelines that process many pages simultaneously are put at risk of OOM exceptions.

The text was updated successfully, but these errors were encountered:

Partially addresses #3081 by checking for a '/ToUnicode' map in Type3 font dictionaries. If no such map is present, check to see if the font is using standard Adobe glyph names. If not, mark the font as 'uninterpretable' and prevent collection of text content from any text operations associated with the font.

shartzog mentioned this issue Jan 26, 2025

ROB: Prevent excessive layout mode text output from Type3 fonts #3082

Merged

stefan6419846 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

shartzog commented Jan 26, 2025

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Comments

shartzog commented Jan 26, 2025

Environment

Code + PDF

Traceback