Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Open
shartzog opened this issue Jan 26, 2025 · 0 comments
Open

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

shartzog opened this issue Jan 26, 2025 · 0 comments
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@shartzog
Copy link
Contributor

The Type3 font specification in the PDF 1.7 standard allows producers to execute arbitrary glyph drawing commands on a per character code basis. This feature can be used to render non-text PDF content (e.g. charts and graphs) using the standard PDF text operators (Td, Tj, etc). It also allows producers to render text content visually without providing any mechanism for translating said content back to a true encoded character. For example, the drawing commands associated with character code 65 ("A") could be used to draw a "Z" or a unicorn or a fire breathing dragon or an "A". In such situations, extracting text in layout mode can result in massively inflated outputs, putting users at risk of OOM exceptions.

Environment

(pdfextnew) C:\Users\samha\pdf-extractor>python -m platform
Windows-10-10.0.22631-SP0
Python 3.10.6 | packaged by conda-forge | (main, Oct 24 2022, 16:02:16) [MSC v.1916 64 bit (AMD64)] on win32

(pdfextnew) C:\Users\samha\pdf-extractor>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0

Code + PDF

>>> from pypdf import PdfReader
>>> r = PdfReader('c:/users/samha/downloads/UninterpretableType3Font.pdf')
>>> layout_output = r.pages[0].extract_text(extraction_mode="layout")
>>> print(len(layout_output))
9947317

UninterpretableType3Font.pdf

Traceback

This issue does not directly result in any exception. However, the contents of layout_output in the sample above will contain no information of value (only long strings of spaces interspersed with an occasional named reference to a CharProcs entry), and due to its consumption of nearly 10MB of memory, user pipelines that process many pages simultaneously are put at risk of OOM exceptions.

@stefan6419846 stefan6419846 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jan 27, 2025
stefan6419846 pushed a commit that referenced this issue Jan 27, 2025
Partially addresses #3081 by checking for a '/ToUnicode' map in Type3 font dictionaries. If no such map is present, check to see if the font is using standard Adobe glyph names. If not, mark the font as 'uninterpretable' and prevent collection of text content from any text operations associated with the font.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants