Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local caching of Functions, by reference, in the PDFFunctionFactory (issue 2541) #12034

Merged
merged 1 commit into from
Jul 4, 2020

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Jun 28, 2020

Note that compared other structures, such as e.g. Images and ColorSpaces, Functions are not referred to by name, which however does bring the advantage of being able to share the cache for an entire page.
Furthermore, similar to ColorSpaces, the parsing of individual Functions are generally fast enough to not really warrant trying to cache them in any "smarter" way than by reference. (Hence trying to do caching similar to e.g. Fonts would most likely be a losing proposition, given the amount of data lookup/parsing that'd be required.)

Originally I tried implementing this similar to e.g. the recently added ColorSpace caching (and in a couple of different ways), however it unfortunately turned out to be quite ugly/unwieldy given the sheer number of functions/methods where you'd thus need to pass in a LocalFunctionCache instance. (Also, the affected functions/methods didn't exactly have short signatures as-is.)
After going back and forth on this for a while it seemed to me that the simplest, or least "invasive" if you will, solution would be if each PartialEvaluator instance had its own PDFFunctionFactory instance (since the latter is already passed to all of the required code). This way each PDFFunctionFactory instances could have a local Function cache, without it being necessary to provide a LocalFunctionCache instance manually at every PDFFunctionFactory.{create, createFromArray} call-site.

Obviously, with this patch, there's now (potentially) more PDFFunctionFactory instances than before when the entire document shared just one. However, each such instance is really quite small and it's also tied to a PartialEvaluator instance and those are not kept alive and/or cached. To reduce the impact of these changes, I've tried to make as many of these structures as possible lazily initialized, specifically:

  • The PDFFunctionFactory, on PartialEvaluator instances, since not all kinds of general parsing actually requires it. For example: getTextContent calls won't cause any Function to be parsed, and even some getOperatorList calls won't trigger Function parsing (if a page contains e.g. no Patterns or "complex" ColorSpaces).

  • The LocalFunctionCache, on PDFFunctionFactory instances, since only certain parsing requires it. Generally speaking, only e.g. Patterns, "complex" ColorSpaces, and/or (some) SoftMasks will trigger any Function parsing.

To put these changes into perspective, when loading/rendering all (14) pages of the default tracemonkey.pdf file there's now a total of 6 PDFFunctionFactory and 1 LocalFunctionCache instances created thanks to the lazy initialization.
(If you instead would keep the document-"global" PDFFunctionFactory instance and pass around LocalFunctionCache instances everywhere, the numbers for the tracemonkey.pdf file would be instead be something like 1 PDFFunctionFactory and 6 LocalFunctionCache instances.)
All-in-all, I thus don't think that the PDFFunctionFactory changes should be generally problematic.

With these changes, we can also modify (some) call-sites to pass in a Reference rather than the actual Function data. This is nice since Functions can also be Streams, which are not cached on the XRef instance (given their potential size), and this way we can avoid unnecessary lookups and thus save some additional time/resources.

Obviously I had intended to include (standard) benchmark results with these changes, but for reasons I don't really understand the test run-time (even with master) of the document in issue 2541 is quite a bit slower than in the development viewer.
However, logging the time it takes for the relevant PDFFunctionFactory/PDFFunction parsing shows that it takes approximately 0.5 ms for the Function in question. Looking up a cached Function, on the other hand, is one order of magnitude faster which does add up when the same Function is invoked close to 2000 times.

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/3887112ee88ead3/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/694bc80c7d35ba6/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/3887112ee88ead3/output.txt

Total script time: 25.84 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/3887112ee88ead3/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/694bc80c7d35ba6/output.txt

Total script time: 29.01 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/694bc80c7d35ba6/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus force-pushed the Function-local-cache-3 branch 9 times, most recently from 347facc to 57a292a Compare June 30, 2020 19:30
@Snuffleupagus Snuffleupagus marked this pull request as ready for review June 30, 2020 19:32
@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f05c82d485dd480/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/f05c82d485dd480/output.txt

Total script time: 29.34 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/f05c82d485dd480/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus changed the title Add local caching of Functions, by reference, in the PDFFunctionFactory Add local caching of Functions, by reference, in the PDFFunctionFactory (issue 2541) Jul 1, 2020
…ctory` (issue 2541)

Note that compared other structures, such as e.g. Images and ColorSpaces, `Function`s are not referred to by name, which however does bring the advantage of being able to share the cache for an *entire* page.
Furthermore, similar to ColorSpaces, the parsing of individual `Function`s are generally fast enough to not really warrant trying to cache them in any "smarter" way than by reference. (Hence trying to do caching similar to e.g. Fonts would most likely be a losing proposition, given the amount of data lookup/parsing that'd be required.)

Originally I tried implementing this similar to e.g. the recently added ColorSpace caching (and in a couple of different ways), however it unfortunately turned out to be quite ugly/unwieldy given the sheer number of functions/methods where you'd thus need to pass in a `LocalFunctionCache` instance. (Also, the affected functions/methods didn't exactly have short signatures as-is.)
After going back and forth on this for a while it seemed to me that the simplest, or least "invasive" if you will, solution would be if each `PartialEvaluator` instance had its *own* `PDFFunctionFactory` instance (since the latter is already passed to all of the required code). This way each `PDFFunctionFactory` instances could have a local `Function` cache, without it being necessary to provide a `LocalFunctionCache` instance manually at every `PDFFunctionFactory.{create, createFromArray}` call-site.

Obviously, with this patch, there's now (potentially) more `PDFFunctionFactory` instances than before when the entire document shared just one. However, each such instance is really quite small and it's also tied to a `PartialEvaluator` instance and those are *not* kept alive and/or cached. To reduce the impact of these changes, I've tried to make as many of these structures as possible *lazily initialized*, specifically:

 - The `PDFFunctionFactory`, on `PartialEvaluator` instances, since not all kinds of general parsing actually requires it. For example: `getTextContent` calls won't cause any `Function` to be parsed, and even some `getOperatorList` calls won't trigger `Function` parsing (if a page contains e.g. no Patterns or "complex" ColorSpaces).

 - The `LocalFunctionCache`, on `PDFFunctionFactory` instances, since only certain parsing requires it. Generally speaking, only e.g. Patterns, "complex" ColorSpaces, and/or (some) SoftMasks will trigger any `Function` parsing.

To put these changes into perspective, when loading/rendering all (14) pages of the default `tracemonkey.pdf` file there's now a total of 6 `PDFFunctionFactory` and 1 `LocalFunctionCache` instances created thanks to the lazy initialization.
(If you instead would keep the document-"global" `PDFFunctionFactory` instance and pass around `LocalFunctionCache` instances everywhere, the numbers for the `tracemonkey.pdf` file would be instead be something like 1 `PDFFunctionFactory` and 6 `LocalFunctionCache` instances.)
All-in-all, I thus don't think that the `PDFFunctionFactory` changes should be generally problematic.

With these changes, we can also modify (some) call-sites to pass in a `Reference` rather than the actual `Function` data. This is nice since `Function`s can also be `Streams`, which are not cached on the `XRef` instance (given their potential size), and this way we can avoid unnecessary lookups and thus save some additional time/resources.

Obviously I had intended to include (standard) benchmark results with these changes, but for reasons I don't really understand the test run-time (even with `master`) of the document in issue 2541 is quite a bit slower than in the development viewer.
However, logging the time it takes for the relevant `PDFFunctionFactory`/`PDFFunction ` parsing shows that it takes *approximately* `0.5 ms` for the `Function` in question. Looking up a cached `Function`, on the other hand, is *one order of magnitude faster* which does add up when the same `Function` is invoked close to 2000 times.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Jul 3, 2020

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/6d1f9254395cb43/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jul 3, 2020

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/97dd93abf9c78ef/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jul 3, 2020

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/97dd93abf9c78ef/output.txt

Total script time: 26.21 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/97dd93abf9c78ef/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Jul 3, 2020

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/6d1f9254395cb43/output.txt

Total script time: 29.07 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/6d1f9254395cb43/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

pdfjsbot commented Jul 4, 2020

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/82b629c7a831a16/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jul 4, 2020

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/82b629c7a831a16/output.txt

Total script time: 3.33 mins

Published

@timvandermeij timvandermeij merged commit 29548ad into mozilla:master Jul 4, 2020
@timvandermeij
Copy link
Contributor

Thank you for improving this!

@Snuffleupagus Snuffleupagus deleted the Function-local-cache-3 branch July 4, 2020 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants