Fixes caching of inline images during parsing. #5445

CodingFabian · 2014-10-26T16:06:43Z

As described in #5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a "lazy" caching, which would
only consider a image cache-worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A1 A2 A3 A4 would be
seen as A1 A2 A2 A2 by the evaluator, which prevents using the "repeat"
optimization.

Also the previous cache implementation was only checking the last used
image.
Thus the sequence A1 B1 A2 B2 A3 B3 would be 6 instances of images, even
when there are only two different ones.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is pretty low.
Also this implementation will now keep hold of any cachable images. Not
only the last seen image.

The two examples from above would now be A A A A and A1 B1 A1 B1 A1 B1,
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of #2618)

CodingFabian · 2014-10-26T16:10:34Z

for the mentioned pdf, this reduces render times by 50%:

before:

Page: 1
Page Request 148ms
Rendering    3854ms
Overall      4002ms

after:

Page: 1
Page Request 152ms
Rendering    2046ms
Overall      2198ms

Snuffleupagus · 2014-10-26T16:14:45Z

/botio-windows test

pdfjsbot · 2014-10-26T16:14:46Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/a9fd94bd2e163ba/output.txt

pdfjsbot · 2014-10-26T16:17:41Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/a9fd94bd2e163ba/output.txt

Total script time: 2.92 mins

Font tests: FAILED
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/a9fd94bd2e163ba/reftest-analyzer.html#web=eq.log

CodingFabian · 2014-10-26T16:18:00Z

if the time spend calculating the adler32 checksum is an issue for pdfs where the cache is not hit, we could make a cache for candidates by length and then calculate the adler32 lazy if a second image of same length is encountered.
But while it is possible that any pdf is slowed down by the change, i would also say that i believe that any pdf would benefit from the change as well :-)

Snuffleupagus · 2014-10-26T22:29:21Z

Did this patch actually pass the regression tests locally?
I noticed that at least tutorial.pdf fails with this patch applied (on Windows and Firefox Nightly 36).

CodingFabian · 2014-10-26T22:50:03Z

Jonas, I do not think this PR is work in progress.
for example the tutorial.pdf renders just fine. I however cannot check a potential windows regression. The windows bot you started needs to be kicked. maybe @yurydelendik can kick it. Can you post me a screenshot of the diff in the meantime?

CodingFabian · 2014-10-26T23:06:24Z

ok, i see it. page 14 for example. strange. my best guess is that the checksumming inst as good as it should be. now that i changed it to alllow more than one entry...
lets see what happens if i change it back to only one entry

CodingFabian · 2014-10-26T23:26:41Z

thanks for the heads up @Snuffleupagus - it really is that the checksumming considers two different images as equal. ugh.

CodingFabian · 2014-10-26T23:38:58Z

according to what i just read about the adler32, i think that checksumming is not really suitable. because my patch causes more images to be potentially reused, the checksum now generates more false positives.
I will update the pr with a more reliable checksumming tomorrow.

CodingFabian · 2014-10-26T23:54:05Z

could we run a linux bot on the current state? besides the adlr32 weakness, i assume that the only regression difference might be due to the binary mask.

yurydelendik · 2014-10-27T03:31:31Z

/botio-windows test

pdfjsbot · 2014-10-27T03:31:32Z

From: Bot.io (Windows)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/d737f85d40d5df1/output.txt

pdfjsbot · 2014-10-27T03:51:50Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/d737f85d40d5df1/output.txt

Total script time: 20.30 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/d737f85d40d5df1/reftest-analyzer.html#web=eq.log

CodingFabian · 2014-10-27T07:50:26Z

looks good, doesn't it?

yurydelendik · 2014-10-27T14:45:41Z

src/core/parser.js

@@ -221,7 +216,7 @@ var Parser = (function ParserClosure() {

      imageStream = this.filter(imageStream, dict, length);
      imageStream.dict = dict;
-      if (cacheImage) {
+      if (adler32) {


this is a chance 1:2^32 that adler32 can be 0

yurydelendik · 2014-10-27T14:47:03Z

Does adler32 slower or faster than https://github.com/mozilla/pdf.js/blob/master/src/core/murmurhash3.js ?

yurydelendik · 2014-10-27T14:48:20Z

looks good, doesn't it?

The failures somewhat expected, but we have to be careful.

CodingFabian · 2014-10-27T14:55:23Z

ill measure MurmurHash3_64 and see how it goes.

It takes about twice as long. (for the pdf in question 398ms vs 149ms)

CodingFabian · 2014-10-27T15:41:43Z

i figured it out. it was not the fact that writing in the cache caused the problem. it was the reading part that never assumed more than one entry in it.
i fixed that now and i will stay with the adler32 until proven otherwise.

yurydelendik · 2014-10-27T15:57:41Z

/botio-windows test

pdfjsbot · 2014-10-27T15:57:42Z

From: Bot.io (Windows)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/9adcf0ad2e2ed3c/output.txt

pdfjsbot · 2014-10-27T16:15:48Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/9adcf0ad2e2ed3c/output.txt

Total script time: 18.09 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/9adcf0ad2e2ed3c/reftest-analyzer.html#web=eq.log

fkaelberer · 2014-10-28T14:09:54Z

If adler32 speed makes a difference for overall rendering time, then we could also skip the modulo operation, if MAX_LENGTH_TO_CACHE is not greater than 5552 according to the German Wikipedia article:

var a = 1;
var b = 0;
for (i = 0, ii = imageBytes.length; i < ii; ++i) {
  // no modulo required in the loop if imageBytes.length < 5552
  a += imageBytes[i] & 0xff; 
  b += a;
}
adler32 = ((b % 65521) << 16) | (a % 65521);

(Quick sanity check: a <= 255 * MAX_LENGTH ==> b <= MAX_LENGTH * MAX_LENGTH * 255 ==> no overflow if MAX_LENGTH <= 1000, so it does not matter if we compute % 65521 every round or only once in the end.)

CodingFabian · 2014-10-28T14:21:41Z

well it has an impact. i will measure the difference and update the PR, @fkaelberer

Its about twice as fast. Great suggestion. Thanks.

As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A1 A2 A3 A4 would be seen as A1 A2 A2 A2 by the evaluator, which prevents using the "repeat" optimization. Also only the last encountered image is cached, so A1 B1 A2 B2, would stay A1 B1 A2 B2. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. It also caches any eligible image by its adler32. The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1 which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)

yurydelendik · 2014-12-15T16:03:40Z

/botio test

pdfjsbot · 2014-12-15T16:03:41Z

From: Bot.io (Linux)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/e6c84d6e7269dc1/output.txt

pdfjsbot · 2014-12-15T16:03:41Z

From: Bot.io (Windows)

Received

Command cmd_test from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/96be496c0b2908b/output.txt

pdfjsbot · 2014-12-15T16:20:50Z

From: Bot.io (Windows)

Failed

Full output at http://107.22.172.223:8877/96be496c0b2908b/output.txt

Total script time: 17.13 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.22.172.223:8877/96be496c0b2908b/reftest-analyzer.html#web=eq.log

pdfjsbot · 2014-12-15T16:26:12Z

From: Bot.io (Linux)

Failed

Full output at http://107.21.233.14:8877/e6c84d6e7269dc1/output.txt

Total script time: 22.51 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://107.21.233.14:8877/e6c84d6e7269dc1/reftest-analyzer.html#web=eq.log

yurydelendik · 2014-12-15T16:51:21Z

/botio makeref

Thank you for the patch

pdfjsbot · 2014-12-15T16:51:22Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @yurydelendik received. Current queue size: 0

Live output at: http://107.22.172.223:8877/7d2734ade5d8cf4/output.txt

pdfjsbot · 2014-12-15T16:51:22Z

From: Bot.io (Linux)

Received

Command cmd_makeref from @yurydelendik received. Current queue size: 0

Live output at: http://107.21.233.14:8877/4ce79fafcef6541/output.txt

Fixes caching of inline images during parsing.

pdfjsbot · 2014-12-15T17:08:23Z

From: Bot.io (Windows)

Success

Full output at http://107.22.172.223:8877/7d2734ade5d8cf4/output.txt

Total script time: 17.02 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2014-12-15T17:13:30Z

From: Bot.io (Linux)

Success

Full output at http://107.21.233.14:8877/4ce79fafcef6541/output.txt

Total script time: 22.14 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus added the core label Oct 26, 2014

Snuffleupagus added the 4-work-in-progress label Oct 26, 2014

CodingFabian force-pushed the fixImageCachingInParser branch from 7e47b48 to 163fa7b Compare October 26, 2014 23:20

CodingFabian force-pushed the fixImageCachingInParser branch from 163fa7b to 7969fd5 Compare October 27, 2014 08:16

Snuffleupagus removed the 4-work-in-progress label Oct 27, 2014

yurydelendik reviewed Oct 27, 2014
View reviewed changes

CodingFabian force-pushed the fixImageCachingInParser branch from 7969fd5 to 946cebe Compare October 27, 2014 15:51

CodingFabian force-pushed the fixImageCachingInParser branch from 946cebe to 970c048 Compare October 28, 2014 14:39

timvandermeij assigned yurydelendik Dec 8, 2014

yurydelendik added a commit that referenced this pull request Dec 15, 2014

Merge pull request #5445 from CodingFabian/fixImageCachingInParser

f5df30f

Fixes caching of inline images during parsing.

yurydelendik merged commit f5df30f into mozilla:master Dec 15, 2014

CodingFabian mentioned this pull request Dec 15, 2014

PDF.js slow at rendering complex image #2618

Closed

CodingFabian deleted the fixImageCachingInParser branch December 15, 2014 19:15

Snuffleupagus mentioned this pull request Aug 18, 2017

Fix caching of small inline images in Parser.makeInlineImage (issue 8790) #8792

Merged

Fixes caching of inline images during parsing. #5445

Fixes caching of inline images during parsing. #5445

Conversation

CodingFabian commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

Snuffleupagus commented Oct 26, 2014

pdfjsbot commented Oct 26, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Oct 26, 2014

From: Bot.io (Windows)

Failed

CodingFabian commented Oct 26, 2014

Snuffleupagus commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

CodingFabian commented Oct 26, 2014

yurydelendik commented Oct 27, 2014

pdfjsbot commented Oct 27, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Oct 27, 2014

From: Bot.io (Windows)

Failed

CodingFabian commented Oct 27, 2014

yurydelendik Oct 27, 2014

Choose a reason for hiding this comment

yurydelendik commented Oct 27, 2014

yurydelendik commented Oct 27, 2014

CodingFabian commented Oct 27, 2014

CodingFabian commented Oct 27, 2014

yurydelendik commented Oct 27, 2014

pdfjsbot commented Oct 27, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Oct 27, 2014

From: Bot.io (Windows)

Failed

fkaelberer commented Oct 28, 2014

CodingFabian commented Oct 28, 2014

yurydelendik commented Dec 15, 2014

pdfjsbot commented Dec 15, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Dec 15, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 15, 2014

From: Bot.io (Windows)

Failed

pdfjsbot commented Dec 15, 2014

From: Bot.io (Linux)

Failed

yurydelendik commented Dec 15, 2014

pdfjsbot commented Dec 15, 2014

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 15, 2014

From: Bot.io (Linux)

Received

pdfjsbot commented Dec 15, 2014

From: Bot.io (Windows)

Success

pdfjsbot commented Dec 15, 2014

From: Bot.io (Linux)

Success