Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Clear all caches in XRef.indexObjects, and improve /Root dictionary validation in XRef.parse (issue 14303) #14338

Merged

Conversation

Snuffleupagus
Copy link
Collaborator

This patch improves handling of a couple of PDF documents from issue #14303.

  • Update XRef.indexObjects to actually clear all XRef-caches. Invalid XRef tables usually cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.

  • Improve the /Root dictionary validation in XRef.parse (PR Fix various corrupt PDF files (issue 9252, issue 9418) #9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched and that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. Catalog.toplevelPagesDict will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.

  • Throw an InvalidPDFException, rather than a general FormatError, in XRef.parse when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)

With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular GHOSTSCRIPT-698804-1-fuzzed.pdf will now both load and "render" correctly, whereas poppler-395-0-fuzzed.pdf will now fail immediately upon loading (rather than appearing to work).

…t dictionary validation in `XRef.parse` (issue 14303)

*This patch improves handling of a couple of PDF documents from issue 14303.*

 - Update `XRef.indexObjects` to actually clear *all* XRef-caches. Invalid XRef tables *usually* cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.

 - Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched *and* that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.

 - Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)

With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than *appearing* to work).
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Dec 3, 2021

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/46932625e377ffb/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Dec 3, 2021

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/12bada7e2c6d489/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Dec 3, 2021

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/46932625e377ffb/output.txt

Total script time: 21.03 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 7
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/46932625e377ffb/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Dec 3, 2021

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/12bada7e2c6d489/output.txt

Total script time: 42.10 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/12bada7e2c6d489/reftest-analyzer.html#web=eq.log

@timvandermeij timvandermeij merged commit 335c4c8 into mozilla:master Dec 4, 2021
@timvandermeij
Copy link
Contributor

LGTM. Thank you for improving this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants