Handle invalid metadata values #320

samkit-jain · 2020-11-29T12:01:43Z

This PR fixes issue #316

Background: The PDF exhibiting the exception contained a metadata key Changes which was a list of PDFObjRef objects. The code responsible for resolving metadata expected a list to be of string-like objects and since in this case, it wasn't, the decode_text() function raised the TypeError exception as mentioned in the issue.

This metadata value type is not as per the PDF specification (see p556 sec14.3). PDF specifications are well, just specifications. Not conforming to it does not necessarily render a PDF unreadable and a lot of PDFs don't abide by every single rule mentioned in it. Hence, in the first commit, I added a try-except block that would log a warning to the console when a metadata value is unparseable and continues to the next one. This would allow the user to perform non-metadata related operations without any issue. I think we can also make this behaviour configurable by adding a new parameter, say, fail_on_metadata in the PDF.open() method that when set to True, will raise an exception and when False, just log to the console.

Fix: To resolve the issue, I have added a new method resolve_and_decode() which would recursively go through the metadata values and decode them.

Additional: Pinned versions of pytest and removed unused imports in tests/test_issues.py.

Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw

Fixes #316

codecov · 2020-11-29T12:03:59Z

Codecov Report

Merging #320 (b5239ad) into develop (d3b84da) will increase coverage by 0.68%.
The diff coverage is 83.33%.

@@             Coverage Diff             @@
##           develop     #320      +/-   ##
===========================================
+ Coverage    97.48%   98.17%   +0.68%     
===========================================
  Files           10       10              
  Lines         1193     1205      +12     
===========================================
+ Hits          1163     1183      +20     
+ Misses          30       22       -8

Impacted Files	Coverage Δ
pdfplumber/pdf.py	`95.29% <60.00%> (-4.71%)`	⬇️
pdfplumber/utils.py	`100.00% <100.00%> (ø)`
pdfplumber/display.py	`88.81% <0.00%> (+7.45%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3b84da...e849cbb. Read the comment docs.

jsvine · 2020-12-03T04:08:58Z

Thanks, @samkit-jain! I think this is a smart approach. Just a few notes / requests:

I think we can also make this behaviour configurable by adding a new parameter, say, fail_on_metadata in the PDF.open() method that when set to True, will raise an exception and when False, just log to the console.
- I like that. Want to try implementing on this PR? Perhaps the parameter, however, should be strict_metadata? Open to other ideas.
Would you be able to add a test that brings the coverage for pdf.py back up to 100%?
Can you add the related entries to the CHANGELOG.md?

Thanks again!

samkit-jain · 2020-12-08T18:10:38Z

@jsvine I have made the metadata failure condition configurable. Since I am unable to create a PDF that would cause an exception in the new recursive resolver method, I was unable to increase the overall test coverage. Added tests for the draw methods to increase the coverage instead, but, this meant that the test coverage for the files changed in this PR is still below the previous one. That's why the codecov/patch is failing but codecov/project is passing. The changelog has been updated as well.

jsvine · 2020-12-09T13:57:41Z

Thanks, @samkit-jain! I'm slightly wary of incorporating new code that doesn't have a PDF to test it with, but have merged this since (a) the logic of the code makes sense, (b) the untested part of the code handles what would already be edge-cases, and (c) the tests pass for all the PDFs we are using.

@samkit-jain

Code and commits by @samkit-jain: * Treat invalid/unparseable metadata values as warnings — Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw * Recursively parse metadata values to handle nested `PDFObjRef` objects — Fixes #316 * Resolve lint issues and remove unused imports * Make metadata parse failure handling behaviour configurable * Update tests to bump up test coverage * Update changelog Co-authored-by: Samkit Jain <15127115+samkit-jain@users.noreply.github.com>

samkit-jain added 3 commits November 26, 2020 21:58

Treat invalid/unparseable metadata values as warnings

9f0a1ad

Certain invalid values if parseable don't throw a warning and only unparseable (always invalid) throw

Recursively parse metadata values to handle nested PDFObjRef objects

2d9415c

Fixes #316

Resolve lint issues and remove unused imports

4ccf8cf

samkit-jain requested a review from jsvine November 29, 2020 12:01

samkit-jain added 3 commits December 8, 2020 22:38

Make metadata parse failure handling behaviour configurable

d352280

Update tests to bump up test coverage

02656b2

Update changelog

e849cbb

jsvine merged commit 0447ce5 into develop Dec 9, 2020

samkit-jain deleted the issue-316 branch December 9, 2020 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid metadata values #320

Handle invalid metadata values #320

samkit-jain commented Nov 29, 2020

codecov bot commented Nov 29, 2020 •

edited

Loading

jsvine commented Dec 3, 2020

samkit-jain commented Dec 8, 2020

jsvine commented Dec 9, 2020

Handle invalid metadata values #320

Handle invalid metadata values #320

Conversation

samkit-jain commented Nov 29, 2020

codecov bot commented Nov 29, 2020 • edited Loading

Codecov Report

jsvine commented Dec 3, 2020

samkit-jain commented Dec 8, 2020

jsvine commented Dec 9, 2020

codecov bot commented Nov 29, 2020 •

edited

Loading