Stop caching Streams in `XRef.fetchCompressed` #11370

Snuffleupagus · 2019-11-30T09:23:33Z

I'm slightly surprised that this hasn't actually caused any (known) bugs, but that may be more luck than anything else since it fortunately doesn't seem common for Streams to be defined inside of an 'ObjStm'.[1]

Note that in the XRef.fetchUncompressed method we're not caching Streams, and that for very good reasons too:

Streams, especially the DecodeStream ones, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.
Attempting to read from the same Stream more than once won't work, unless it's reset in between, since using any method such as e.g. getBytes always starts at the current data position.
Given that even the src/core/ code is now fairly asynchronous, see e.g. the PartialEvaluator, it's generally impossible to assert that any one Stream isn't being accessed "concurrently" by e.g. different getOperatorList calls. Hence reset-ing a cached Streams isn't going to work in the general case.

All in all, I cannot understand why it'd ever be correct to cache Streams in the XRef.fetchCompressed method.

[1] One example where that happens is the issue3115r.pdf file in the test-suite, where the streams in question are not actually used for anything within the PDF.js code.

- Change all occurences of `var` to `let`/`const`. - Initialize the (temporary) Arrays with the correct sizes upfront. - Inline the `isCmd` check. Obviously this won't make a huge difference, but given that the check is only relevant for corrupt documents it cannot hurt.

I'm slightly surprised that this hasn't actually caused any (known) bugs, but that may be more luck than anything else since it fortunately doesn't seem common for Streams to be defined inside of an 'ObjStm'.[1] Note that in the `XRef.fetchUncompressed` method we're *not* caching Streams, and that for very good reasons too. - Streams, especially the `DecodeStream` ones, can become *very* large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so. - Attempting to read from the *same* Stream more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position. - Given that even the `src/core/` code is now fairly asynchronous, see e.g. the `PartialEvaluator`, it's generally impossible to assert that any one Stream isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Streams isn't going to work in the general case. All in all, I cannot understand why it'd ever be correct to cache Streams in the `XRef.fetchCompressed` method. --- [1] One example where that happens is the `issue3115r.pdf` file in the test-suite, where the streams in question are not actually used for anything within the PDF.js code.

Snuffleupagus · 2019-11-30T09:26:36Z

/botio test

pdfjsbot · 2019-11-30T09:26:37Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5662bfe67d2732d/output.txt

pdfjsbot · 2019-11-30T09:26:37Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/777751a13e2419c/output.txt

pdfjsbot · 2019-11-30T09:45:22Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/5662bfe67d2732d/output.txt

Total script time: 18.73 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-11-30T09:53:14Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/777751a13e2419c/output.txt

Total script time: 26.59 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/777751a13e2419c/reftest-analyzer.html#web=eq.log

timvandermeij · 2019-11-30T13:57:24Z

Thank you! I agree with your analysis, and looking at the original code I think it was an oversight.

timvandermeij · 2019-11-30T14:03:10Z

/botio-windows makeref

pdfjsbot · 2019-11-30T14:03:11Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/aa5d37e496074a0/output.txt

pdfjsbot · 2019-11-30T14:26:56Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/aa5d37e496074a0/output.txt

Total script time: 23.73 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus · 2019-12-01T10:07:12Z

I agree with your analysis, and looking at the original code I think it was an oversight.

Yes, it absolutely looks like nothing more than an oversight since the original code didn't support parsing of Streams within a compressed XRef entry. That was changed in PR #2341, in order to address issue #2337 which so happens to be the very first PDF.js bug report I ever submitted :-)

Snuffleupagus added 2 commits November 30, 2019 09:49

timvandermeij added the core label Nov 30, 2019

timvandermeij approved these changes Nov 30, 2019

View reviewed changes

timvandermeij merged commit 62ec810 into mozilla:master Nov 30, 2019

Snuffleupagus deleted the fetchCompressed-isStream branch November 30, 2019 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop caching Streams in `XRef.fetchCompressed` #11370

Stop caching Streams in `XRef.fetchCompressed` #11370

Snuffleupagus commented Nov 30, 2019

Snuffleupagus commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

timvandermeij commented Nov 30, 2019

timvandermeij commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

Snuffleupagus commented Dec 1, 2019

Stop caching Streams in XRef.fetchCompressed #11370

Stop caching Streams in XRef.fetchCompressed #11370

Conversation

Snuffleupagus commented Nov 30, 2019

Snuffleupagus commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 30, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 30, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Nov 30, 2019

From: Bot.io (Windows)

Failed

timvandermeij commented Nov 30, 2019

timvandermeij commented Nov 30, 2019

pdfjsbot commented Nov 30, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 30, 2019

From: Bot.io (Windows)

Success

Snuffleupagus commented Dec 1, 2019

Stop caching Streams in `XRef.fetchCompressed` #11370

Stop caching Streams in `XRef.fetchCompressed` #11370