ROB: Handle outlines without destination #1076

mtd91429 · 2022-07-08T18:40:07Z

Adjust PdfReader._build_outline(...) and PdfReader._build_destination(...) to handle outline items with and without valid destinations

Closes #193 : PdfReadError: Unexpected destination '/__WKANCHOR_2'
Closes #956 : ValueError: Unresolved bookmark

#1059 no longer throws an exception, but the outlines are not extracted either.

Closes #1068 : Skip NameObject when building outline

mtd91429 · 2022-07-08T19:59:30Z

Failed on CI due to bookmark/outline in this. The outlines for "Tables" and "Figures" have no destination (i.e., /A is ). Instead, they simply serve as a placeholder to help organize the children headings. Although it is an odd type of bookmark/outline, I personally don't think they should fail to return these bookmarks as it produces unexpected results. Typical PDF reading software show these outlines. Will work on a fix for the issue.

PyPDF2/generic.py

codecov · 2022-07-10T23:35:26Z

Codecov Report

Merging #1076 (f0abf9d) into main (f233c1a) will decrease coverage by 0.01%.
The diff coverage is 84.37%.

@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   92.13%   92.12%   -0.02%     
==========================================
  Files          24       24              
  Lines        4756     4772      +16     
  Branches      985      989       +4     
==========================================
+ Hits         4382     4396      +14     
- Misses        226      228       +2     
  Partials      148      148

Impacted Files	Coverage Δ
PyPDF2/_merger.py	`95.00% <ø> (+2.46%)`	⬆️
PyPDF2/_reader.py	`90.18% <84.37%> (-1.01%)`	⬇️
PyPDF2/_writer.py	`89.32% <0.00%> (+0.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f233c1a...f0abf9d. Read the comment docs.

mtd91429 · 2022-07-10T23:39:38Z

Cannot be sure without the file, but believe this commit will address issue noted in #956,

xilopaint · 2022-07-12T05:54:46Z

Cannot be sure without the file, but believe this commit will address issue noted in #956

I can confirm this PR fixes #956. Please @MartinThoma, add this PR to the next release.

PyPDF2/_merger.py

PyPDF2/_reader.py

PyPDF2/generic.py

MartinThoma · 2022-07-13T16:46:43Z

@mtd91429 This PR now contains multiple different things. The color and font format attributes (with the OutlineItem enum) seem pretty straight forward. I think we can merge them very soon. The other parts need more attention as they could break things.

What do you think about making a second PR with the simple changes that we can merge directly?

mtd91429 · 2022-07-13T21:29:43Z

What do you think about making a second PR with the simple changes that we can merge directly?

Seems like a reasonable idea. I've created the changes for an outline's color and font format in a new PR #1104. I'll ultimately remove those congruent changes from this PR branch and focus this PR on handling outlines without destinations

MartinThoma · 2022-07-15T09:48:41Z

Currently, the test_unexpected_destination() does no longer raise an excepting. I need to think about this, but my first impulse is that this might actually be the desired behavior. Hence we might need to adjust the test.

What do you think?

mtd91429 · 2022-07-15T17:56:33Z

Currently, the test_unexpected_destination() does no longer raise an excepting. I need to think about this, but my first impulse is that this might actually be the desired behavior. Hence we might need to adjust the test.

What do you think?

I agree. In the latest commit, I have deleted this test.

mtd91429 · 2022-07-15T19:13:15Z

I've overhauled the logic of how the PdfReader module handles malformed outlines. Let me try to summarize what I've attempted to accomplish with the code and my reasoning; the implementation review is still necessary.

The code base is designed such that outline objects are a subtype of a destination object. While that is the intuitive purpose of an outline object, it is not required by the PDF specification. By this design, outlines lacking a valid destination were ignored. As an example, open the PDF document tika-924546.pdf in Adobe Acrobat (or another commercial PDF viewer) and look at the outline. You'll notice that there are entries entitled "Tables" and "Figures". With the current repository (lacking this comit's changes), run the following code and examine the output.

def show_tree(outlines, indent=0):
    for item in outlines:
        if isinstance(item, list):
            show_tree(item, indent+4)
        else:
            print(f'{" "*indent}{item.title}')

show_tree(PdfReader("tika-924546.pdf").outlines)


>>>...
>>>    Revenue Projections
>>>    Spending Projections
>>>Glossary
>>>    Summary 1. CBO's Baseline Budget Outlook
>>>    Summary 2. CBO’s Economic Projections for Calendar Years 2006 to 2016
>>>    1-1. Projected Deficits and Surpluses in CBO’s Baseline
>>>...
>>>    F-12. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005
>>>    F-13. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005 
>>>    Summary 1. Spending on Social Security, Medicare, and Medicaid, 1990 to 2016
>>>    Summary 2. Total Revenues and Outlays as a Percentage of Gross Domestic Product, 1965 to 2016
>>>    1-1. The Total Deficit or Surplus as a Percentage of GDP, 1965 to 2016 
>>>...

... indicated truncated lines removed for demonstration purposes.

With the current commit's changes, running the above show_tree() snippet produces the following output:

>>>...
>>>       Revenue Projections
>>>    Spending Projections
>>>Glossary
>>>Tables
>>>    Summary 1. CBO's Baseline Budget Outlook
>>>    Summary 2. CBO’s Economic Projections for Calendar Years 2006 to 2016
>>>    1-1. Projected Deficits and Surpluses in CBO’s Baseline
>>>...
>>>    F-12. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005
>>>    F-13. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005 
>>>Figures
>>>    Summary 1. Spending on Social Security, Medicare, and Medicaid, 1990 to 2016
>>>    Summary 2. Total Revenues and Outlays as a Percentage of Gross Domestic Product, 1965 to 2016
>>>    1-1. The Total Deficit or Surplus as a Percentage of GDP, 1965 to 2016 
>>>...

I found this specific instance via CI when trying to implement the color and font format code. I subsequently went down a rabbit hole and found "malformed outlines" in a few more PDFs and felt that PyPDF2's behavior was not optimal.

In the method _build_outline(), the only "required" feature of an outline that is added at this step is the "Title". Although the "/Title" attribute is required by PDF specification 1.7, commercial viewers handle outlines with missing title's without causing an error. For example, open the new PDF located in the resource folder of this commit entitle "outline-without-title.pdf" in Adobe Acrobat. For that reason, I implemented an optional error message to be raised if in "strict" mode. For this, created the unit test test_outline_missing_title().

After that, the method attempts to locate the destination associated with the outline object. If a destination is found, the outline is returned with a destination. If a destination is not found, the outline returns with a Null destination. This is the behavior that @MartinThoma is discussing in that last comment; I don't think it should silently ignore the outline object or throw an error.

I found a few examples of "real world" PDFs in the repository's code base for which the unit tests test_outline_with_missing_named_destination() and test_outline_with_empty_action() were written to handle. Then, I tried to be as comprehensive as possible for all malformed destinations and created the document "outlines-with-invalid-destinations.pdf" from the sample file "pdflatex-outline.pdf". To create that document, I first decompressed the original file (so that I could manipulate the objects in a text editor). I then altered all of the outline objects individually to produce valid or invalid destinations. My manipulations broke the xref table, so I had to re-save the document (which re-compressed it). Unfortunately, I had to use non-PyPDF2 software to accomplish these tasks (ultimately, I'd like to see if I can implement them with the current repository or, if not, expand the code base and/or documentation in a way to accomplish the same goals). The unit test test_outlines_with_invalid_destinations() uses this document and returns all outlines (with valid and invalid destinations).

For my changes, I implemented a lot of # type: ignore for mypy. I'm not familiar with this library and didn't want attempt to re-write the whole object model to suppress a few warnings; please let me know if you have any better suggestions for this.

This commit also addresses #193 I believe and makes #1068 unnecessary (see _reader.py lines 863 - 870 in 670ba91.

The current git history is quite messy for this branch. Before it is eventually merged onto the main, I'll try to squash everything into a single commit (🤞I don't make it even messier - I tried to rebase and instead pulled in the latest commits).

MartinThoma · 2022-07-17T18:56:09Z

The current git history is quite messy for this branch. Before it is eventually merged onto the main, I'll try to squash everything into a single commit

Don't worry about this. I always make squashing commits.

It helps me a lot if the first post of a PR contains what you would like to see as the commit message.

PyPDF2/_merger.py

mtd91429 · 2022-07-17T19:47:13Z

It helps me a lot if the first post of a PR contains what you would like to see as the commit message.

I edited the first comment, above

mtd91429 · 2022-07-22T16:59:33Z

I have rebased the changes onto the latest release (2.7.0) and squashed it all into a single commit. Assuming the review is favorable, I think it should be ready for merge onto the main.

Based on the above conversations, this commit should address issues #193 and #956; it makes PR #1068 unnecessary.

MartinThoma · 2022-07-22T18:00:07Z

Very nice!

I will review it this weekend :-)

MartinThoma · 2022-07-22T18:00:59Z

Just FYI: I'm editing the title of the PR and your first comment to what I will use as the commit message. If you think that something is missing / should be adjusted, feel free to do there :-)

PyPDF2/_reader.py

MartinThoma · 2022-07-23T05:56:04Z

The following gives an error:

from PyPDF2 import PdfReader

# You can also just download the file and use the path:
from io import BytesIO
from tests import get_pdf_from_url
url = "https://corpora.tika.apache.org/base/docs/govdocs1/975/975357.pdf"
name = "tika-975357.pdf"
data = BytesIO(get_pdf_from_url(url, name=name))

reader = PdfReader(data)
reader.outlines

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 685, in outlines
    return self._get_outlines()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 710, in _get_outlines
    self._namedDests = self._get_named_destinations()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 649, in _get_named_destinations
    self._get_named_destinations(kid.get_object(), retval)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 659, in _get_named_destinations
    dest = self._build_destination(key, value)  # type: ignore
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 827, in _build_destination
    page, typ = array[0:2]  # type: ignore
TypeError: 'NullObject' object is not subscriptable

PyPDF2/_reader.py

MartinThoma · 2022-07-23T06:22:44Z

Thank you so much for your contribution 🤗 I read the code, ran a couple of checks, and added minor changes. Now PyPDF2 handles reading outlines way better 🎉

I will make a release to PyPI on Sunday which contains this fix.

New Features (ENH): - Add writer.add_annotation, page.annotations, and generic.AnnotationBuilder (#1120) Bug Fixes (BUG): - Set /AS for /Btn form fields in writer (#1161) - Ignore if /Perms verify failed (#1157) Robustness (ROB): - Cope with utf16 character for space calculation (#1155) - Cope with null params for FitH / FitV destination (#1152) - Handle outlines without valid destination (#1076) Developer Experience (DEV): - Introduce _utils.logger_warning (#1148) Maintenance (MAINT): - Break up parse_to_unicode (#1162) - Add diagnostic output to exception in read_from_stream (#1159) - Reduce PdfReader.read complexity (#1151) Testing (TST): - Add workflow tests found by arc testing (#1154) - Decrypt file which is not encrypted (#1149) - Test CryptRC4 encryption class; test image extraction filters (#1147) Full Changelog: 2.7.0...2.8.0

This comment was marked as outdated.

Sign in to view

MartinThoma reviewed Jul 8, 2022

View reviewed changes

PyPDF2/generic.py Outdated Show resolved Hide resolved

mtd91429 force-pushed the main branch 3 times, most recently from 2e8e19b to 72203eb Compare July 9, 2022 21:09

MartinThoma changed the title ~~add color and text format to returned outline~~ ENH: Add color and text format to returned outline Jul 10, 2022

MartinThoma added the PdfReader The PdfReader component is affected label Jul 10, 2022

mtd91429 force-pushed the main branch from e42059c to fd0a5b8 Compare July 10, 2022 22:27

mtd91429 changed the title ~~ENH: Add color and text format to returned outline~~ ENH: Add color and text format to returned outline, handle outlines without destination Jul 10, 2022

mtd91429 mentioned this pull request Jul 12, 2022

PyPDF2.utils.PdfReadError: Unexpected destination '/__WKANCHOR_2' #193

Closed

MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jul 13, 2022