-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311
[api-minor] Validate the /Pages-tree /Count entry during document initialization (issue 14303) #14311
Conversation
e471a75
to
3609937
Compare
3609937
to
4b3e575
Compare
4acc0fb
to
1c74f83
Compare
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/b7a4d032047ecc1/output.txt |
1c74f83
to
717ce12
Compare
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/d3cf799f6c3bb84/output.txt Total script time: 22.31 mins
Image differences available at: http://54.241.84.105:8877/d3cf799f6c3bb84/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/b7a4d032047ecc1/output.txt Total script time: 42.30 mins
Image differences available at: http://54.193.163.58:8877/b7a4d032047ecc1/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a reasonable approach; thank you for doing this!
r=me with a rebase to fix the merge conflict
…tialization (issue 14303) *This patch basically extends the approach from PR 10392, by also checking the last page.* Currently, in e.g. the `Catalog.numPages`-getter, we're simply assuming that if the /Pages-tree has an *integer* /Count entry it must also be correct/valid. As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser). Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the *entire* /Pages-tree and essentially counting the pages. To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the *entire* /Pages-tree to determine the number of pages. Unfortunately these changes will have a number of *somewhat* negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug. - This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the *last* page of the PDF documents. - For poorly generated PDF documents, where the entire /Pages-tree only has *one* level, we'll unfortunately need to fetch/parse the *entire* /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of *some* long PDF documents, - This will affect the `disableAutoFetch = true` mode negatively, since we now need to fetch/parse more data during document initialization. While the `disableAutoFetch = true` mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost. As one *small* additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value). Fixes two of the issues listed in issue 14303, namely the `poppler-67295-0.pdf` and `poppler-85140-0.pdf` documents.
717ce12
to
d0c4bbd
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/1ebd26e80fe8921/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/7ecf177407bee5c/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/7ecf177407bee5c/output.txt Total script time: 22.28 mins
Image differences available at: http://54.241.84.105:8877/7ecf177407bee5c/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/1ebd26e80fe8921/output.txt Total script time: 41.91 mins
Image differences available at: http://54.193.163.58:8877/1ebd26e80fe8921/reftest-analyzer.html#web=eq.log |
This patch basically extends the approach from PR #10392, by also checking the last page.
Currently, in e.g. the
Catalog.numPages
-getter, we're simply assuming that if the /Pages-tree has an integer /Count entry it must also be correct/valid.As can be seen in the referenced PDF documents, that entry may be completely bogus which causes general parsing to breaking down elsewhere in the worker-thread (and hanging the browser).
Rather than hoping that the /Count entry is correct, similar to all other data found in PDF documents, we obviously need to validate it. This turns out to be a little less straightforward than one would like, since the only way to do this (as far as I know) is to parse the entire /Pages-tree and essentially counting the pages.
To avoid doing that for all documents, this patch tries to take a short-cut by checking if the last page (based on the /Count entry) can be successfully fetched. If so, we assume that the /Count entry is correct and use it as-is, otherwise we'll iterate through (potentially) the entire /Pages-tree to determine the number of pages.
Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.
disableAutoFetch = true
mode negatively, since we now need to fetch/parse more data during document initialization. While thedisableAutoFetch = true
mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.As one small additional bonus, we should now also be able to support opening PDF documents where the /Pages-tree /Count entry is completely invalid (e.g. contains a non-integer value).
Fixes two of the issues listed in issue #14303, namely the
poppler-67295-0.pdf
andpoppler-85140-0.pdf
documents.