-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROB: Handle outlines without destination #1076
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Failed on CI due to bookmark/outline in this. The outlines for "Tables" and "Figures" have no destination (i.e., |
2e8e19b
to
72203eb
Compare
Codecov Report
@@ Coverage Diff @@
## main #1076 +/- ##
==========================================
- Coverage 92.13% 92.12% -0.02%
==========================================
Files 24 24
Lines 4756 4772 +16
Branches 985 989 +4
==========================================
+ Hits 4382 4396 +14
- Misses 226 228 +2
Partials 148 148
Continue to review full report at Codecov.
|
Cannot be sure without the file, but believe this commit will address issue noted in #956, |
I can confirm this PR fixes #956. Please @MartinThoma, add this PR to the next release. |
@mtd91429 This PR now contains multiple different things. The color and font format attributes (with the OutlineItem enum) seem pretty straight forward. I think we can merge them very soon. The other parts need more attention as they could break things. What do you think about making a second PR with the simple changes that we can merge directly? |
Seems like a reasonable idea. I've created the changes for an outline's color and font format in a new PR #1104. I'll ultimately remove those congruent changes from this PR branch and focus this PR on handling outlines without destinations |
Currently, the test_unexpected_destination() does no longer raise an excepting. I need to think about this, but my first impulse is that this might actually be the desired behavior. Hence we might need to adjust the test. What do you think? |
I agree. In the latest commit, I have deleted this test. |
I've overhauled the logic of how the PdfReader module handles malformed outlines. Let me try to summarize what I've attempted to accomplish with the code and my reasoning; the implementation review is still necessary. The code base is designed such that
... indicated truncated lines removed for demonstration purposes. With the current commit's changes, running the above
I found this specific instance via CI when trying to implement the color and font format code. I subsequently went down a rabbit hole and found "malformed outlines" in a few more PDFs and felt that PyPDF2's behavior was not optimal. In the method After that, the method attempts to locate the destination associated with the outline object. If a destination is found, the outline is returned with a destination. If a destination is not found, the outline returns with a Null destination. This is the behavior that @MartinThoma is discussing in that last comment; I don't think it should silently ignore the outline object or throw an error. I found a few examples of "real world" PDFs in the repository's code base for which the unit tests For my changes, I implemented a lot of This commit also addresses #193 I believe and makes #1068 unnecessary (see _reader.py lines 863 - 870 in 670ba91. The current git history is quite messy for this branch. Before it is eventually merged onto the main, I'll try to squash everything into a single commit (🤞I don't make it even messier - I tried to rebase and instead pulled in the latest commits). |
Don't worry about this. I always make squashing commits. It helps me a lot if the first post of a PR contains what you would like to see as the commit message. |
I edited the first comment, above |
I have rebased the changes onto the latest release (2.7.0) and squashed it all into a single commit. Assuming the review is favorable, I think it should be ready for merge onto the main. Based on the above conversations, this commit should address issues #193 and #956; it makes PR #1068 unnecessary. |
Very nice! I will review it this weekend :-) |
Just FYI: I'm editing the title of the PR and your first comment to what I will use as the commit message. If you think that something is missing / should be adjusted, feel free to do there :-) |
The following gives an error: from PyPDF2 import PdfReader
# You can also just download the file and use the path:
from io import BytesIO
from tests import get_pdf_from_url
url = "https://corpora.tika.apache.org/base/docs/govdocs1/975/975357.pdf"
name = "tika-975357.pdf"
data = BytesIO(get_pdf_from_url(url, name=name))
reader = PdfReader(data)
reader.outlines Traceback:
|
Thank you so much for your contribution 🤗 I read the code, ran a couple of checks, and added minor changes. Now PyPDF2 handles reading outlines way better 🎉 I will make a release to PyPI on Sunday which contains this fix. |
New Features (ENH): - Add writer.add_annotation, page.annotations, and generic.AnnotationBuilder (#1120) Bug Fixes (BUG): - Set /AS for /Btn form fields in writer (#1161) - Ignore if /Perms verify failed (#1157) Robustness (ROB): - Cope with utf16 character for space calculation (#1155) - Cope with null params for FitH / FitV destination (#1152) - Handle outlines without valid destination (#1076) Developer Experience (DEV): - Introduce _utils.logger_warning (#1148) Maintenance (MAINT): - Break up parse_to_unicode (#1162) - Add diagnostic output to exception in read_from_stream (#1159) - Reduce PdfReader.read complexity (#1151) Testing (TST): - Add workflow tests found by arc testing (#1154) - Decrypt file which is not encrypted (#1149) - Test CryptRC4 encryption class; test image extraction filters (#1147) Full Changelog: 2.7.0...2.8.0
Adjust
PdfReader._build_outline(...)
andPdfReader._build_destination(...)
to handle outline items with and without valid destinationsCloses #193 : PdfReadError: Unexpected destination '/__WKANCHOR_2'
Closes #956 : ValueError: Unresolved bookmark
#1059 no longer throws an exception, but the outlines are not extracted either.
Closes #1068 : Skip NameObject when building outline