Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while reading bookmarks/outlines "TypeError: argument of type 'NoneType' is not iterable" #1059

Closed
hassanseoul123 opened this issue Jul 5, 2022 · 7 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected

Comments

@hassanseoul123
Copy link

hassanseoul123 commented Jul 5, 2022

"TypeError: argument of type 'NoneType' is not iterable"
Got this when I tried to read the outlines of a PDF file with PdfReader.outlines.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.1

Code + PDF

Example PDF file: sample.pdf (Yes, you can use this file for tests)

from PyPDF2 import PdfReader
reader = PdfReader("sample.pdf")
print(reader.outlines)

Traceback

This is the complete Traceback I see:

C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py:1089: PdfReadWarning: Object 86 0 not defined.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\Hassan\Desktop\main.py", line 3, in <module>
    outlines = reader.outlines
  File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 674, in outlines
    return self._get_outlines()
  File "C:\Users\Hassan\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 694, in _get_outlines
    if "/First" in lines:
TypeError: argument of type 'NoneType' is not iterable
@MartinThoma
Copy link
Member

Thank you for reporting the issue ❤️

PdfReader.outlines is the one you should use. The others do the same thing, but they are deprecated (see CHANGELOG)

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 5, 2022
@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jul 5, 2022
@MartinThoma
Copy link
Member

The PDF is non standard-compliant. You can see a warning

PdfReadWarning: Object 86 0 not defined

Via https://demo.verapdf.org/ you can see several issues... I'm not certain, though, if they are connected to the problem you face. I think the xref table might just be wrong. I don't know how atril can recover the outlines from it.

@mtd91429
Copy link
Contributor

Yes, the problem in this file is the xref objects.

The way PyPDF2 reads pdfs is it essentially searches for the xref table and parses it. It then uses additional dictionaries within the file in conjunction with the xref table to locate the various objects at their byte location.

For the outline in this example, PyPDF2 processes the /Trailer which points to the document /Root. Root points to the Outline dictionary at object 86 (id number) 0 (generation number). This object is missing. This object (the Outline Dictionary) is supposed to point to the First and Last children (outline items) and is used as the starting point to build the outline tree. The Outline Dictionary exists within the document, just at a different location (i.e., not at 86 0 R). Fixing such an issue is possible with some commercially available PDF software renderers, such as Adobe Acrobat or PDF XChange. However, from what I can tell, fixing such an issue is currently beyond PyPDF2's "plug-n-play" capabilities. I think it could be done with some one-off code specifically for this situation. However, it is probably easiest to simply open and re-save the document in Adobe Acrobat.

For the code base, we could consider adding some logic to the PdfReader code within _get_outlines() method such that if the /Catalog contains a reference to the /Outlines dictionary, but the reference is missing from the xref table, to manually parse the document's objects and attempt to infer it from the attributes defined in Table 152 of PDFv1.7 specification, then update the /Catalog/Outlines pointer value. That would probably be best implemented as part of a larger framework to handle misplaced and/or unreferenced objects rather than a one-off endeavor for this particular niche-bug.

@MartinThoma
Copy link
Member

Outlines chromes can extract:

image

MartinThoma pushed a commit that referenced this issue Jul 23, 2022
Adjust `PdfReader._build_outline(...)` and `PdfReader._build_destination(...)` to handle outline items with and without valid destinations

Closes #193 : PdfReadError: Unexpected destination '/__WKANCHOR_2'
Closes #956 : ValueError: Unresolved bookmark

#1059 no longer throws an exception, but the outlines are not extracted either.

Closes #1068 : Skip NameObject when building outline
@pubpub-zz
Copy link
Collaborator

Retested with Latest dev version (2.10.4+ / 5?) in progress
Same results as Chrome can be observed.
The objects 86 and 88 can be retrieved successfully.

@MartinThoma, this issue should be closed

@pubpub-zz
Copy link
Collaborator

+1?

@MartinThoma
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected
Projects
None yet
Development

No branches or pull requests

4 participants