[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) #14338

Snuffleupagus · 2021-12-03T11:02:52Z

This patch improves handling of a couple of PDF documents from issue #14303.

Update XRef.indexObjects to actually clear all XRef-caches. Invalid XRef tables usually cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well.
Improve the /Root dictionary validation in XRef.parse (PR Fix various corrupt PDF files (issue 9252, issue 9418) #9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched and that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. Catalog.toplevelPagesDict will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents.
Throw an InvalidPDFException, rather than a general FormatError, in XRef.parse when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.)

With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular GHOSTSCRIPT-698804-1-fuzzed.pdf will now both load and "render" correctly, whereas poppler-395-0-fuzzed.pdf will now fail immediately upon loading (rather than appearing to work).

…t dictionary validation in `XRef.parse` (issue 14303) *This patch improves handling of a couple of PDF documents from issue 14303.* - Update `XRef.indexObjects` to actually clear *all* XRef-caches. Invalid XRef tables *usually* cause issues early enough during parsing that we've not populated the XRef-cache, however to prevent any issues we obviously need to clear that one as well. - Improve the /Root dictionary validation in `XRef.parse` (PR 9827 follow-up). In addition to checking that a /Pages entry exists, we'll now also check that it can be successfully fetched *and* that it's of the correct type. There's really no point trying to use a /Root dictionary that e.g. `Catalog.toplevelPagesDict` will reject, and this way we'll be able to fallback to indexing the objects in corrupt documents. - Throw an `InvalidPDFException`, rather than a general `FormatError`, in `XRef.parse` when no usable /Root dictionary could be found. That really seems more appropriate overall, since all attempts at parsing/recovery have failed. (This part of the patch is API-observable, hence the tag.) With these changes, two existing test-cases are improved and the unit-tests are updated/re-factored to highlight that. In particular `GHOSTSCRIPT-698804-1-fuzzed.pdf` will now both load and "render" correctly, whereas `poppler-395-0-fuzzed.pdf` will now fail immediately upon loading (rather than *appearing* to work).

Snuffleupagus · 2021-12-03T11:11:10Z

/botio test

pdfjsbot · 2021-12-03T11:11:11Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/46932625e377ffb/output.txt

pdfjsbot · 2021-12-03T11:11:12Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/12bada7e2c6d489/output.txt

pdfjsbot · 2021-12-03T11:32:14Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/46932625e377ffb/output.txt

Total script time: 21.03 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 7
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/46932625e377ffb/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-12-03T11:53:19Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/12bada7e2c6d489/output.txt

Total script time: 42.10 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/12bada7e2c6d489/reftest-analyzer.html#web=eq.log

timvandermeij · 2021-12-04T12:23:55Z

LGTM. Thank you for improving this!

Snuffleupagus added core corrupted-pdf labels Dec 3, 2021

timvandermeij approved these changes Dec 4, 2021

View reviewed changes

timvandermeij merged commit 335c4c8 into mozilla:master Dec 4, 2021

Snuffleupagus deleted the XRef-more-Pages-validation branch December 4, 2021 13:33

Snuffleupagus mentioned this pull request Dec 13, 2021

Slightly reduce asynchronicity in the Catalog.getPageDict method (PR 14338 follow-up) #14370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) #14338

[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) #14338

Snuffleupagus commented Dec 3, 2021

Snuffleupagus commented Dec 3, 2021

pdfjsbot commented Dec 3, 2021

pdfjsbot commented Dec 3, 2021

pdfjsbot commented Dec 3, 2021

pdfjsbot commented Dec 3, 2021

timvandermeij commented Dec 4, 2021

[api-minor] Clear all caches in XRef.indexObjects, and improve /Root dictionary validation in XRef.parse (issue 14303) #14338

[api-minor] Clear all caches in XRef.indexObjects, and improve /Root dictionary validation in XRef.parse (issue 14303) #14338

Conversation

Snuffleupagus commented Dec 3, 2021

Snuffleupagus commented Dec 3, 2021

pdfjsbot commented Dec 3, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Dec 3, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 3, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Dec 3, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented Dec 4, 2021

[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) #14338

[api-minor] Clear all caches in `XRef.indexObjects`, and improve /Root dictionary validation in `XRef.parse` (issue 14303) #14338