Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid metadata values #320

Merged
merged 6 commits into from
Dec 9, 2020
Merged

Handle invalid metadata values #320

merged 6 commits into from
Dec 9, 2020

Conversation

samkit-jain
Copy link
Collaborator

This PR fixes issue #316

Background: The PDF exhibiting the exception contained a metadata key Changes which was a list of PDFObjRef objects. The code responsible for resolving metadata expected a list to be of string-like objects and since in this case, it wasn't, the decode_text() function raised the TypeError exception as mentioned in the issue.

This metadata value type is not as per the PDF specification (see p556 sec14.3). PDF specifications are well, just specifications. Not conforming to it does not necessarily render a PDF unreadable and a lot of PDFs don't abide by every single rule mentioned in it. Hence, in the first commit, I added a try-except block that would log a warning to the console when a metadata value is unparseable and continues to the next one. This would allow the user to perform non-metadata related operations without any issue. I think we can also make this behaviour configurable by adding a new parameter, say, fail_on_metadata in the PDF.open() method that when set to True, will raise an exception and when False, just log to the console.

Fix: To resolve the issue, I have added a new method resolve_and_decode() which would recursively go through the metadata values and decode them.

Additional: Pinned versions of pytest and removed unused imports in tests/test_issues.py.

@samkit-jain samkit-jain requested a review from jsvine November 29, 2020 12:01
@codecov
Copy link

codecov bot commented Nov 29, 2020

Codecov Report

Merging #320 (b5239ad) into develop (d3b84da) will increase coverage by 0.68%.
The diff coverage is 83.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #320      +/-   ##
===========================================
+ Coverage    97.48%   98.17%   +0.68%     
===========================================
  Files           10       10              
  Lines         1193     1205      +12     
===========================================
+ Hits          1163     1183      +20     
+ Misses          30       22       -8     
Impacted Files Coverage Δ
pdfplumber/pdf.py 95.29% <60.00%> (-4.71%) ⬇️
pdfplumber/utils.py 100.00% <100.00%> (ø)
pdfplumber/display.py 88.81% <0.00%> (+7.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3b84da...e849cbb. Read the comment docs.

@jsvine
Copy link
Owner

jsvine commented Dec 3, 2020

Thanks, @samkit-jain! I think this is a smart approach. Just a few notes / requests:

  • I think we can also make this behaviour configurable by adding a new parameter, say, fail_on_metadata in the PDF.open() method that when set to True, will raise an exception and when False, just log to the console.

    • I like that. Want to try implementing on this PR? Perhaps the parameter, however, should be strict_metadata? Open to other ideas.
  • Would you be able to add a test that brings the coverage for pdf.py back up to 100%?

  • Can you add the related entries to the CHANGELOG.md?

Thanks again!

@samkit-jain
Copy link
Collaborator Author

@jsvine I have made the metadata failure condition configurable. Since I am unable to create a PDF that would cause an exception in the new recursive resolver method, I was unable to increase the overall test coverage. Added tests for the draw methods to increase the coverage instead, but, this meant that the test coverage for the files changed in this PR is still below the previous one. That's why the codecov/patch is failing but codecov/project is passing. The changelog has been updated as well.

@jsvine jsvine merged commit 0447ce5 into develop Dec 9, 2020
@samkit-jain samkit-jain deleted the issue-316 branch December 9, 2020 13:57
@jsvine
Copy link
Owner

jsvine commented Dec 9, 2020

Thanks, @samkit-jain! I'm slightly wary of incorporating new code that doesn't have a PDF to test it with, but have merged this since (a) the logic of the code makes sense, (b) the untested part of the code handles what would already be edge-cases, and (c) the tests pass for all the PDFs we are using.

jsvine added a commit that referenced this pull request Dec 9, 2020
Code and commits by @samkit-jain:

* Treat invalid/unparseable metadata values as warnings — Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw

* Recursively parse metadata values to handle nested `PDFObjRef` objects — Fixes #316

* Resolve lint issues and remove unused imports

* Make metadata parse failure handling behaviour configurable

* Update tests to bump up test coverage

* Update changelog

Co-authored-by: Samkit Jain <15127115+samkit-jain@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants