Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop caching Streams in XRef.fetchCompressed #11370

Merged
merged 2 commits into from
Nov 30, 2019

Conversation

Snuffleupagus
Copy link
Collaborator

I'm slightly surprised that this hasn't actually caused any (known) bugs, but that may be more luck than anything else since it fortunately doesn't seem common for Streams to be defined inside of an 'ObjStm'.[1]

Note that in the XRef.fetchUncompressed method we're not caching Streams, and that for very good reasons too:

  • Streams, especially the DecodeStream ones, can become very large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.

  • Attempting to read from the same Stream more than once won't work, unless it's reset in between, since using any method such as e.g. getBytes always starts at the current data position.

  • Given that even the src/core/ code is now fairly asynchronous, see e.g. the PartialEvaluator, it's generally impossible to assert that any one Stream isn't being accessed "concurrently" by e.g. different getOperatorList calls. Hence reset-ing a cached Streams isn't going to work in the general case.

All in all, I cannot understand why it'd ever be correct to cache Streams in the XRef.fetchCompressed method.


[1] One example where that happens is the issue3115r.pdf file in the test-suite, where the streams in question are not actually used for anything within the PDF.js code.

 - Change all occurences of `var` to `let`/`const`.

 - Initialize the (temporary) Arrays with the correct sizes upfront.

 - Inline the `isCmd` check. Obviously this won't make a huge difference, but given that the check is only relevant for corrupt documents it cannot hurt.
I'm slightly surprised that this hasn't actually caused any (known) bugs, but that may be more luck than anything else since it fortunately doesn't seem common for Streams to be defined inside of an 'ObjStm'.[1]

Note that in the `XRef.fetchUncompressed` method we're *not* caching Streams, and that for very good reasons too.

 - Streams, especially the `DecodeStream` ones, can become *very* large once read. Hence caching them really isn't a good idea simply because of the (potential) memory impact of doing so.

 - Attempting to read from the *same* Stream more than once won't work, unless it's `reset` in between, since using any method such as e.g. `getBytes` always starts at the current data position.

 - Given that even the `src/core/` code is now fairly asynchronous, see e.g. the `PartialEvaluator`, it's generally impossible to assert that any one Stream isn't being accessed "concurrently" by e.g. different `getOperatorList` calls. Hence `reset`-ing a cached Streams isn't going to work in the general case.

All in all, I cannot understand why it'd ever be correct to cache Streams in the `XRef.fetchCompressed` method.

---
[1] One example where that happens is the `issue3115r.pdf` file in the test-suite, where the streams in question are not actually used for anything within the PDF.js code.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5662bfe67d2732d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/777751a13e2419c/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/5662bfe67d2732d/output.txt

Total script time: 18.73 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/777751a13e2419c/output.txt

Total script time: 26.59 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/777751a13e2419c/reftest-analyzer.html#web=eq.log

@timvandermeij timvandermeij merged commit 62ec810 into mozilla:master Nov 30, 2019
@timvandermeij
Copy link
Contributor

Thank you! I agree with your analysis, and looking at the original code I think it was an oversight.

@timvandermeij
Copy link
Contributor

/botio-windows makeref

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/aa5d37e496074a0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/aa5d37e496074a0/output.txt

Total script time: 23.73 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@Snuffleupagus Snuffleupagus deleted the fetchCompressed-isStream branch November 30, 2019 16:55
@Snuffleupagus
Copy link
Collaborator Author

I agree with your analysis, and looking at the original code I think it was an oversight.

Yes, it absolutely looks like nothing more than an oversight since the original code didn't support parsing of Streams within a compressed XRef entry. That was changed in PR #2341, in order to address issue #2337 which so happens to be the very first PDF.js bug report I ever submitted :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants