-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid metadata values #320
Conversation
Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw
Codecov Report
@@ Coverage Diff @@
## develop #320 +/- ##
===========================================
+ Coverage 97.48% 98.17% +0.68%
===========================================
Files 10 10
Lines 1193 1205 +12
===========================================
+ Hits 1163 1183 +20
+ Misses 30 22 -8
Continue to review full report at Codecov.
|
Thanks, @samkit-jain! I think this is a smart approach. Just a few notes / requests:
Thanks again! |
@jsvine I have made the metadata failure condition configurable. Since I am unable to create a PDF that would cause an exception in the new recursive resolver method, I was unable to increase the overall test coverage. Added tests for the draw methods to increase the coverage instead, but, this meant that the test coverage for the files changed in this PR is still below the previous one. That's why the codecov/patch is failing but codecov/project is passing. The changelog has been updated as well. |
Thanks, @samkit-jain! I'm slightly wary of incorporating new code that doesn't have a PDF to test it with, but have merged this since (a) the logic of the code makes sense, (b) the untested part of the code handles what would already be edge-cases, and (c) the tests pass for all the PDFs we are using. |
Code and commits by @samkit-jain: * Treat invalid/unparseable metadata values as warnings — Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw * Recursively parse metadata values to handle nested `PDFObjRef` objects — Fixes #316 * Resolve lint issues and remove unused imports * Make metadata parse failure handling behaviour configurable * Update tests to bump up test coverage * Update changelog Co-authored-by: Samkit Jain <15127115+samkit-jain@users.noreply.github.com>
This PR fixes issue #316
Background: The PDF exhibiting the exception contained a metadata key
Changes
which was a list ofPDFObjRef
objects. The code responsible for resolving metadata expected a list to be of string-like objects and since in this case, it wasn't, thedecode_text()
function raised theTypeError
exception as mentioned in the issue.This metadata value type is not as per the PDF specification (see p556 sec14.3). PDF specifications are well, just specifications. Not conforming to it does not necessarily render a PDF unreadable and a lot of PDFs don't abide by every single rule mentioned in it. Hence, in the first commit, I added a
try-except
block that would log a warning to the console when a metadata value is unparseable and continues to the next one. This would allow the user to perform non-metadata related operations without any issue. I think we can also make this behaviour configurable by adding a new parameter, say,fail_on_metadata
in thePDF.open()
method that when set toTrue
, will raise an exception and whenFalse
, just log to the console.Fix: To resolve the issue, I have added a new method
resolve_and_decode()
which would recursively go through the metadata values and decode them.Additional: Pinned versions of
pytest
and removed unused imports intests/test_issues.py
.