[api-minor] Clear all caches in XRef.indexObjects
, and improve /Root dictionary validation in XRef.parse
(issue 14303)
#14338
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch improves handling of a couple of PDF documents from issue #14303.
Update
XRef.indexObjects
to actually clear all XRef-caches. Invalid XRef tables usually cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.Improve the /Root dictionary validation in
XRef.parse
(PR Fix various corrupt PDF files (issue 9252, issue 9418) #9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched and that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g.Catalog.toplevelPagesDict
will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.Throw an
InvalidPDFException
, rather than a generalFormatError
, inXRef.parse
when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular
GHOSTSCRIPT-698804-1-fuzzed.pdf
will now both load and "render" correctly, whereaspoppler-395-0-fuzzed.pdf
will now fail immediately upon loading (rather than appearing to work).