Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311

Merged
merged 1 commit into from
Nov 27, 2021

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Nov 25, 2021

This patch basically extends the approach from PR #10392, by also checking the last page.

Currently, in e.g. the Catalog.numPages-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.

  • This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents.
  • For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents,
  • This will affect the disableAutoFetch = true mode negatively, since we now need to fetch/parse more data during document initialization. While the disableAutoFetch = true mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue #14303, namely the poppler-67295-0.pdf and poppler-85140-0.pdf documents.

@Snuffleupagus Snuffleupagus force-pushed the validate-Pages-Count branch 7 times, most recently from e471a75 to 3609937 Compare November 25, 2021 21:46
@Snuffleupagus Snuffleupagus marked this pull request as ready for review November 25, 2021 21:50
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 26, 2021
@Snuffleupagus Snuffleupagus force-pushed the validate-Pages-Count branch 2 times, most recently from 4acc0fb to 1c74f83 Compare November 27, 2021 11:52
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/b7a4d032047ecc1/output.txt

@mozilla mozilla deleted a comment from pdfjsbot Nov 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 27, 2021
@mozilla mozilla deleted a comment from pdfjsbot Nov 27, 2021
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt

Total script time: 22.31 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 6
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d3cf799f6c3bb84/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/b7a4d032047ecc1/output.txt

Total script time: 42.30 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/b7a4d032047ecc1/reftest-analyzer.html#web=eq.log

Copy link
Contributor

@timvandermeij timvandermeij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable approach; thank you for doing this!

r=me with a rebase to fix the merge conflict

@timvandermeij timvandermeij changed the title Validate the /Pages-tree /Count entry during document initialization (issue 14303) [api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) Nov 27, 2021
…tialization (issue 14303)

*This patch basically extends the approach from PR 10392, by also checking the last page.*

Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid.
As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).

Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages.

Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
 - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents.
 - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents,
 - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).

Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1ebd26e80fe8921/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/7ecf177407bee5c/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/7ecf177407bee5c/output.txt

Total script time: 22.28 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 10

Image differences available at: http://54.241.84.105:8877/7ecf177407bee5c/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/1ebd26e80fe8921/output.txt

Total script time: 41.91 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/1ebd26e80fe8921/reftest-analyzer.html#web=eq.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants