Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROB: Handle outlines without destination #1076

Merged
merged 4 commits into from
Jul 23, 2022
Merged

Conversation

mtd91429
Copy link
Contributor

@mtd91429 mtd91429 commented Jul 8, 2022

Adjust PdfReader._build_outline(...) and PdfReader._build_destination(...) to handle outline items with and without valid destinations

Closes #193 : PdfReadError: Unexpected destination '/__WKANCHOR_2'
Closes #956 : ValueError: Unresolved bookmark

#1059 no longer throws an exception, but the outlines are not extracted either.

Closes #1068 : Skip NameObject when building outline

@mtd91429

This comment was marked as outdated.

@mtd91429
Copy link
Contributor Author

mtd91429 commented Jul 8, 2022

Failed on CI due to bookmark/outline in this. The outlines for "Tables" and "Figures" have no destination (i.e., /A is ). Instead, they simply serve as a placeholder to help organize the children headings. Although it is an odd type of bookmark/outline, I personally don't think they should fail to return these bookmarks as it produces unexpected results. Typical PDF reading software show these outlines. Will work on a fix for the issue.

PyPDF2/generic.py Outdated Show resolved Hide resolved
@mtd91429 mtd91429 force-pushed the main branch 3 times, most recently from 2e8e19b to 72203eb Compare July 9, 2022 21:09
@MartinThoma MartinThoma changed the title add color and text format to returned outline ENH: Add color and text format to returned outline Jul 10, 2022
@MartinThoma MartinThoma added the PdfReader The PdfReader component is affected label Jul 10, 2022
@mtd91429 mtd91429 changed the title ENH: Add color and text format to returned outline ENH: Add color and text format to returned outline, handle outlines without destination Jul 10, 2022
@codecov
Copy link

codecov bot commented Jul 10, 2022

Codecov Report

Merging #1076 (f0abf9d) into main (f233c1a) will decrease coverage by 0.01%.
The diff coverage is 84.37%.

@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   92.13%   92.12%   -0.02%     
==========================================
  Files          24       24              
  Lines        4756     4772      +16     
  Branches      985      989       +4     
==========================================
+ Hits         4382     4396      +14     
- Misses        226      228       +2     
  Partials      148      148              
Impacted Files Coverage Δ
PyPDF2/_merger.py 95.00% <ø> (+2.46%) ⬆️
PyPDF2/_reader.py 90.18% <84.37%> (-1.01%) ⬇️
PyPDF2/_writer.py 89.32% <0.00%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f233c1a...f0abf9d. Read the comment docs.

@mtd91429
Copy link
Contributor Author

Cannot be sure without the file, but believe this commit will address issue noted in #956,

@xilopaint
Copy link
Contributor

Cannot be sure without the file, but believe this commit will address issue noted in #956

I can confirm this PR fixes #956. Please @MartinThoma, add this PR to the next release.

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jul 13, 2022
PyPDF2/_reader.py Outdated Show resolved Hide resolved
PyPDF2/_reader.py Outdated Show resolved Hide resolved
PyPDF2/generic.py Outdated Show resolved Hide resolved
PyPDF2/generic.py Outdated Show resolved Hide resolved
PyPDF2/generic.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

@mtd91429 This PR now contains multiple different things. The color and font format attributes (with the OutlineItem enum) seem pretty straight forward. I think we can merge them very soon. The other parts need more attention as they could break things.

What do you think about making a second PR with the simple changes that we can merge directly?

@mtd91429
Copy link
Contributor Author

What do you think about making a second PR with the simple changes that we can merge directly?

Seems like a reasonable idea. I've created the changes for an outline's color and font format in a new PR #1104. I'll ultimately remove those congruent changes from this PR branch and focus this PR on handling outlines without destinations

@mtd91429 mtd91429 marked this pull request as draft July 13, 2022 21:31
@mtd91429 mtd91429 changed the title ENH: Add color and text format to returned outline, handle outlines without destination ENH: Handle outlines without destination Jul 13, 2022
@MartinThoma
Copy link
Member

Currently, the test_unexpected_destination() does no longer raise an excepting. I need to think about this, but my first impulse is that this might actually be the desired behavior. Hence we might need to adjust the test.

What do you think?

@mtd91429 mtd91429 marked this pull request as ready for review July 15, 2022 17:42
@mtd91429
Copy link
Contributor Author

Currently, the test_unexpected_destination() does no longer raise an excepting. I need to think about this, but my first impulse is that this might actually be the desired behavior. Hence we might need to adjust the test.

What do you think?

I agree. In the latest commit, I have deleted this test.

@mtd91429
Copy link
Contributor Author

I've overhauled the logic of how the PdfReader module handles malformed outlines. Let me try to summarize what I've attempted to accomplish with the code and my reasoning; the implementation review is still necessary.

The code base is designed such that outline objects are a subtype of a destination object. While that is the intuitive purpose of an outline object, it is not required by the PDF specification. By this design, outlines lacking a valid destination were ignored. As an example, open the PDF document tika-924546.pdf in Adobe Acrobat (or another commercial PDF viewer) and look at the outline. You'll notice that there are entries entitled "Tables" and "Figures". With the current repository (lacking this comit's changes), run the following code and examine the output.

def show_tree(outlines, indent=0):
    for item in outlines:
        if isinstance(item, list):
            show_tree(item, indent+4)
        else:
            print(f'{" "*indent}{item.title}')

show_tree(PdfReader("tika-924546.pdf").outlines)


>>>...
>>>    Revenue Projections
>>>    Spending Projections
>>>Glossary
>>>    Summary 1. CBO's Baseline Budget Outlook
>>>    Summary 2. CBO’s Economic Projections for Calendar Years 2006 to 2016
>>>    1-1. Projected Deficits and Surpluses in CBO’s Baseline
>>>...
>>>    F-12. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005
>>>    F-13. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005 
>>>    Summary 1. Spending on Social Security, Medicare, and Medicaid, 1990 to 2016
>>>    Summary 2. Total Revenues and Outlays as a Percentage of Gross Domestic Product, 1965 to 2016
>>>    1-1. The Total Deficit or Surplus as a Percentage of GDP, 1965 to 2016 
>>>...

... indicated truncated lines removed for demonstration purposes.

With the current commit's changes, running the above show_tree() snippet produces the following output:

>>>...
>>>       Revenue Projections
>>>    Spending Projections
>>>Glossary
>>>Tables
>>>    Summary 1. CBO's Baseline Budget Outlook
>>>    Summary 2. CBO’s Economic Projections for Calendar Years 2006 to 2016
>>>    1-1. Projected Deficits and Surpluses in CBO’s Baseline
>>>...
>>>    F-12. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005
>>>    F-13. Standardized-Budget Deficit or Surplus and Related Series, 1962 to 2005 
>>>Figures
>>>    Summary 1. Spending on Social Security, Medicare, and Medicaid, 1990 to 2016
>>>    Summary 2. Total Revenues and Outlays as a Percentage of Gross Domestic Product, 1965 to 2016
>>>    1-1. The Total Deficit or Surplus as a Percentage of GDP, 1965 to 2016 
>>>...

I found this specific instance via CI when trying to implement the color and font format code. I subsequently went down a rabbit hole and found "malformed outlines" in a few more PDFs and felt that PyPDF2's behavior was not optimal.

In the method _build_outline(), the only "required" feature of an outline that is added at this step is the "Title". Although the "/Title" attribute is required by PDF specification 1.7, commercial viewers handle outlines with missing title's without causing an error. For example, open the new PDF located in the resource folder of this commit entitle "outline-without-title.pdf" in Adobe Acrobat. For that reason, I implemented an optional error message to be raised if in "strict" mode. For this, created the unit test test_outline_missing_title().

After that, the method attempts to locate the destination associated with the outline object. If a destination is found, the outline is returned with a destination. If a destination is not found, the outline returns with a Null destination. This is the behavior that @MartinThoma is discussing in that last comment; I don't think it should silently ignore the outline object or throw an error.

I found a few examples of "real world" PDFs in the repository's code base for which the unit tests test_outline_with_missing_named_destination() and test_outline_with_empty_action() were written to handle. Then, I tried to be as comprehensive as possible for all malformed destinations and created the document "outlines-with-invalid-destinations.pdf" from the sample file "pdflatex-outline.pdf". To create that document, I first decompressed the original file (so that I could manipulate the objects in a text editor). I then altered all of the outline objects individually to produce valid or invalid destinations. My manipulations broke the xref table, so I had to re-save the document (which re-compressed it). Unfortunately, I had to use non-PyPDF2 software to accomplish these tasks (ultimately, I'd like to see if I can implement them with the current repository or, if not, expand the code base and/or documentation in a way to accomplish the same goals). The unit test test_outlines_with_invalid_destinations() uses this document and returns all outlines (with valid and invalid destinations).

For my changes, I implemented a lot of # type: ignore for mypy. I'm not familiar with this library and didn't want attempt to re-write the whole object model to suppress a few warnings; please let me know if you have any better suggestions for this.

This commit also addresses #193 I believe and makes #1068 unnecessary (see _reader.py lines 863 - 870 in 670ba91.

The current git history is quite messy for this branch. Before it is eventually merged onto the main, I'll try to squash everything into a single commit (🤞I don't make it even messier - I tried to rebase and instead pulled in the latest commits).

@MartinThoma
Copy link
Member

The current git history is quite messy for this branch. Before it is eventually merged onto the main, I'll try to squash everything into a single commit

Don't worry about this. I always make squashing commits.

It helps me a lot if the first post of a PR contains what you would like to see as the commit message.

PyPDF2/_merger.py Outdated Show resolved Hide resolved
@mtd91429
Copy link
Contributor Author

It helps me a lot if the first post of a PR contains what you would like to see as the commit message.

I edited the first comment, above

@mtd91429
Copy link
Contributor Author

I have rebased the changes onto the latest release (2.7.0) and squashed it all into a single commit. Assuming the review is favorable, I think it should be ready for merge onto the main.

Based on the above conversations, this commit should address issues #193 and #956; it makes PR #1068 unnecessary.

@MartinThoma MartinThoma changed the title ENH: Handle outlines without destination ROB: Handle outlines without destination Jul 22, 2022
@MartinThoma MartinThoma added the is-robustness-issue From a users perspective, this is about robustness label Jul 22, 2022
@MartinThoma
Copy link
Member

Very nice!

I will review it this weekend :-)

@MartinThoma
Copy link
Member

Just FYI: I'm editing the title of the PR and your first comment to what I will use as the commit message. If you think that something is missing / should be adjusted, feel free to do there :-)

PyPDF2/_reader.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

The following gives an error:

from PyPDF2 import PdfReader

# You can also just download the file and use the path:
from io import BytesIO
from tests import get_pdf_from_url
url = "https://corpora.tika.apache.org/base/docs/govdocs1/975/975357.pdf"
name = "tika-975357.pdf"
data = BytesIO(get_pdf_from_url(url, name=name))

reader = PdfReader(data)
reader.outlines

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 685, in outlines
    return self._get_outlines()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 710, in _get_outlines
    self._namedDests = self._get_named_destinations()
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 649, in _get_named_destinations
    self._get_named_destinations(kid.get_object(), retval)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 659, in _get_named_destinations
    dest = self._build_destination(key, value)  # type: ignore
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 827, in _build_destination
    page, typ = array[0:2]  # type: ignore
TypeError: 'NullObject' object is not subscriptable

PyPDF2/_reader.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma merged commit 89c0ff2 into py-pdf:main Jul 23, 2022
@MartinThoma
Copy link
Member

Thank you so much for your contribution 🤗 I read the code, ran a couple of checks, and added minor changes. Now PyPDF2 handles reading outlines way better 🎉

I will make a release to PyPI on Sunday which contains this fix.

MartinThoma added a commit that referenced this pull request Jul 24, 2022
New Features (ENH):
-  Add writer.add_annotation, page.annotations, and generic.AnnotationBuilder (#1120)

Bug Fixes (BUG):
-  Set /AS for /Btn form fields in writer (#1161)
-  Ignore if /Perms verify failed (#1157)

Robustness (ROB):
-  Cope with utf16 character for space calculation (#1155)
-  Cope with null params for FitH / FitV destination (#1152)
-  Handle outlines without valid destination (#1076)

Developer Experience (DEV):
-  Introduce _utils.logger_warning (#1148)

Maintenance (MAINT):
-  Break up parse_to_unicode (#1162)
-  Add diagnostic output to exception in read_from_stream (#1159)
-  Reduce PdfReader.read complexity (#1151)

Testing (TST):
-  Add workflow tests found by arc testing (#1154)
-  Decrypt file which is not encrypted (#1149)
-  Test CryptRC4 encryption class; test image extraction filters (#1147)

Full Changelog: 2.7.0...2.8.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: Unresolved bookmark PyPDF2.utils.PdfReadError: Unexpected destination '/__WKANCHOR_2'
3 participants