[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

Snuffleupagus · 2021-11-25T17:53:59Z

This patch basically extends the approach from PR #10392, by also checking the last page.

Currently, in e.g. the Catalog.numPages-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.

This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents.
For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents,
This will affect the disableAutoFetch = true mode negatively, since we now need to fetch/parse more data during document initialization. While the disableAutoFetch = true mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue #14303, namely the poppler-67295-0.pdf and poppler-85140-0.pdf documents.

pdfjsbot · 2021-11-27T11:53:47Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt

pdfjsbot · 2021-11-27T11:53:47Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/b7a4d032047ecc1/output.txt

pdfjsbot · 2021-11-27T12:16:06Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt

Total script time: 22.31 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 6
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d3cf799f6c3bb84/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-11-27T12:36:06Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/b7a4d032047ecc1/output.txt

Total script time: 42.30 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/b7a4d032047ecc1/reftest-analyzer.html#web=eq.log

timvandermeij

Looks like a reasonable approach; thank you for doing this!

r=me with a rebase to fix the merge conflict

…tialization (issue 14303) *This patch basically extends the approach from PR 10392, by also checking the last page.* Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.

Snuffleupagus · 2021-11-27T22:00:21Z

/botio test

pdfjsbot · 2021-11-27T22:00:23Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1ebd26e80fe8921/output.txt

pdfjsbot · 2021-11-27T22:00:23Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/7ecf177407bee5c/output.txt

pdfjsbot · 2021-11-27T22:22:42Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/7ecf177407bee5c/output.txt

Total script time: 22.28 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 10

Image differences available at: http://54.241.84.105:8877/7ecf177407bee5c/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-11-27T22:42:18Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/1ebd26e80fe8921/output.txt

Total script time: 41.91 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/1ebd26e80fe8921/reftest-analyzer.html#web=eq.log

Snuffleupagus added core corrupted-pdf labels Nov 25, 2021

Snuffleupagus force-pushed the validate-Pages-Count branch 7 times, most recently from e471a75 to 3609937 Compare November 25, 2021 21:46

Snuffleupagus marked this pull request as ready for review November 25, 2021 21:50

Snuffleupagus force-pushed the validate-Pages-Count branch from 3609937 to 4b3e575 Compare November 26, 2021 09:16

mozilla deleted a comment from pdfjsbot Nov 26, 2021

Snuffleupagus mentioned this pull request Nov 26, 2021

Prevent circular references in XRef tables from hanging the worker-thread (issue 14303) #14312

Merged

Snuffleupagus force-pushed the validate-Pages-Count branch 2 times, most recently from 4acc0fb to 1c74f83 Compare November 27, 2021 11:52

mozilla deleted a comment from pdfjsbot Nov 27, 2021

Snuffleupagus force-pushed the validate-Pages-Count branch from 1c74f83 to 717ce12 Compare November 27, 2021 12:01

timvandermeij approved these changes Nov 27, 2021

View reviewed changes

timvandermeij changed the title ~~Validate the /Pages-tree /Count entry during document initialization (issue 14303)~~ [api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) Nov 27, 2021

Snuffleupagus force-pushed the validate-Pages-Count branch from 717ce12 to d0c4bbd Compare November 27, 2021 21:56

Snuffleupagus merged commit a2a5376 into mozilla:master Nov 27, 2021

Snuffleupagus mentioned this pull request Nov 27, 2021

Crashes and timeouts on bug tracker corpus files #14303

Closed

Snuffleupagus deleted the validate-Pages-Count branch November 27, 2021 22:48

This was referenced Nov 29, 2021

Remove the unused skipCount parameter from Catalog.getPageDict (PR 14311 follow-up) #14325

Merged

[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

Merged

Snuffleupagus mentioned this pull request Feb 16, 2022

Get "unnecessary" range on first page #14570

Closed

Snuffleupagus mentioned this pull request Oct 19, 2022

Relax the /Pages dictionary /Count check for corrupt documents (issue 9105) #15593

Merged

conorom mentioned this pull request Mar 22, 2023

Chrome and range requests #9022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

Snuffleupagus commented Nov 25, 2021 •

edited

Loading

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

timvandermeij left a comment

Snuffleupagus commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

Conversation

Snuffleupagus commented Nov 25, 2021 • edited Loading

pdfjsbot commented Nov 27, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 27, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 27, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Nov 27, 2021

From: Bot.io (Windows)

Failed

timvandermeij left a comment

Choose a reason for hiding this comment

Snuffleupagus commented Nov 27, 2021

pdfjsbot commented Nov 27, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 27, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 27, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Nov 27, 2021

From: Bot.io (Windows)

Failed

Snuffleupagus commented Nov 25, 2021 •

edited

Loading