[Optimization] Read example pixels only when necessary #69

Mr4k · 2019-11-22T05:23:27Z

Checklist

I have read the Contributor Guide
I have read and agree to the Code of Conduct
I have added a description of my changes and why I'd like them included in the section below

Description of Changes

I'm not sure if this is the kind of contribution you are looking for but I did a little profiling (using instruments, screenshot below) and found out that (on my computer at least) a large amount of time was being taken up by the function k_neighs_to_color_pattern when creating the candidate patterns. It appears that looking up the pixels in the example images is somewhat costly and it is done in an inner loop which ends up contributing significantly to runtime.

To try to cut down on this cost I moved pixel lookups for the candidate's neighbors into the better_match function. I only read each pixel right before it needs to be used in the cost function. This means because are you already stopping a lot of the cost computations early (when the current candidate cost exceeds the smallest candidate cost so far) fewer pixel lookups are performed. This does not change the algorithm at all.

This results in around a 14% - 45% speed up (according to your benchmark test suite on my computer) depending on the example image(s) used and size of output texture. The average speed up seems to be more in the range of 14 - 25% (very unscientifically computed). I assume there are pathological cases where no performance gain could occur but I think they would be rare.

About my computer:
MacBook Pro (Retina, 13-inch, Early 2015)
Processor: 2.9 GHz Intel Core i5 (4 logical cores)
Memory: 8 GB 1867 MHz DDR3

Edit: Additionally tested with a Macbook Pro 2018
2.6 Ghz Intel Core i7 (12 logical cores)
32 GB 2400 MHz DDR4

I also tried to change the code minimally but there was some refactoring.

Disclaimer:
I have not tested this on a wide variety of devices or high end cpus

Related Issues

I don't think this is related to any open issues.

Mr4k · 2019-11-22T05:58:09Z

lib/src/multires_stochastic_texture_synthesis.rs

-                                        );
-
-                                        //get example pattern to compare to
-                                        k_neighs_to_color_pattern(


I'm not sure why this call to generate my_guide_pattern was in the candidates loop. It seemed to not matter so I moved it out

Mr4k · 2019-11-22T06:00:03Z

lib/src/multires_stochastic_texture_synthesis.rs

+    }
+}
+
+fn k_neighs_to_precomputed_reference_pattern(


renamed this function to differentiate it from the now on the fly read candidate pixels, this function is now only called to build my_pattern and my_guide_pattern

Mr4k · 2019-11-22T07:54:54Z

lib/src/multires_stochastic_texture_synthesis.rs

            }
        }
+        score += next_pixel_score * distance_gaussians[i];
+        if score >= current_best {
+            return None;


This is where the heart of the speed up is because we don't read more pixels if we return early

Mr4k · 2019-11-22T07:59:03Z

lib/src/multires_stochastic_texture_synthesis.rs

            }
        }
+        score += next_pixel_score * distance_gaussians[i];


should probably shrink distance gaussians from size (num neighbors) * 4 to just num neighbors as that is all I'm using here

Jake-Shadle · 2019-11-22T16:45:14Z

I unfortunately didn't have time to review this today, but I will check it out Monday, thanks for the PR!

This reverts commit 9cb6f58.

Jake-Shadle

Thanks for the PR, here are the numbers from the baselines I've been running:

group                  pr-33                                  pr-69
-----                  -----                                  -----
guided/100             1.74    389.0±9.75ms        ? B/sec    1.00    224.2±9.99ms        ? B/sec
guided/200             1.58    730.3±4.26ms        ? B/sec    1.00    463.7±6.23ms        ? B/sec
guided/25              1.86    157.0±5.80ms        ? B/sec    1.00     84.4±4.05ms        ? B/sec
guided/400             1.41       2.9±0.01s        ? B/sec    1.00       2.0±0.08s        ? B/sec
guided/50              2.03    321.6±6.41ms        ? B/sec    1.00    158.4±7.57ms        ? B/sec
inpaint/100            1.38    183.5±7.84ms        ? B/sec    1.00    133.2±7.17ms        ? B/sec
inpaint/200            1.18    222.9±7.21ms        ? B/sec    1.00    188.6±5.52ms        ? B/sec
inpaint/25             1.44     44.0±1.20ms        ? B/sec    1.00     30.6±1.07ms        ? B/sec
inpaint/400            1.00    535.5±7.99ms        ? B/sec    1.12    598.6±9.36ms        ? B/sec
inpaint/50             1.48    181.2±7.72ms        ? B/sec    1.00    122.7±7.48ms        ? B/sec
inpaint_channel/100                                           1.00    200.2±8.54ms        ? B/sec
inpaint_channel/200                                           1.00   597.6±51.78ms        ? B/sec
inpaint_channel/25                                            1.00     17.7±0.11ms        ? B/sec
inpaint_channel/400                                           1.00    473.3±8.34ms        ? B/sec
inpaint_channel/50                                            1.00     68.7±7.33ms        ? B/sec
multi_example/100      1.24    225.3±3.79ms        ? B/sec    1.00    182.4±6.30ms        ? B/sec
multi_example/200      1.00    445.4±9.75ms        ? B/sec    1.04    465.3±8.19ms        ? B/sec
multi_example/25       1.54     98.0±4.69ms        ? B/sec    1.00    63.5±11.85ms        ? B/sec
multi_example/400      1.00  1834.9±37.90ms        ? B/sec    1.46       2.7±0.10s        ? B/sec
multi_example/50       1.53    182.5±3.27ms        ? B/sec    1.00    119.4±4.23ms        ? B/sec
single_example/100     1.18    208.5±4.18ms        ? B/sec    1.00   176.3±13.28ms        ? B/sec
single_example/200     1.00    425.8±4.54ms        ? B/sec    1.23   523.7±10.91ms        ? B/sec
single_example/25      1.87     84.1±4.66ms        ? B/sec    1.00     45.1±2.80ms        ? B/sec
single_example/400     1.00  1709.8±12.95ms        ? B/sec    1.55       2.6±0.04s        ? B/sec
single_example/50      1.44    176.7±4.70ms        ? B/sec    1.00   122.3±12.19ms        ? B/sec
style_transfer/100     1.64    383.9±5.37ms        ? B/sec    1.00    234.7±8.10ms        ? B/sec
style_transfer/200     1.51    740.4±3.05ms        ? B/sec    1.00    489.6±5.67ms        ? B/sec
style_transfer/25      1.80    148.2±5.87ms        ? B/sec    1.00     82.6±5.66ms        ? B/sec
style_transfer/400     1.51       2.9±0.01s        ? B/sec    1.00  1924.1±42.17ms        ? B/sec
style_transfer/50      1.67    300.1±9.62ms        ? B/sec    1.00   180.2±11.65ms        ? B/sec
tiling/100             1.32    233.5±6.02ms        ? B/sec    1.00    177.2±5.26ms        ? B/sec
tiling/200             1.00    396.7±4.94ms        ? B/sec    1.11    439.9±9.27ms        ? B/sec
tiling/25              1.66     77.6±3.56ms        ? B/sec    1.00     46.7±2.91ms        ? B/sec
tiling/400             1.00  1518.5±39.65ms        ? B/sec    1.39       2.1±0.04s        ? B/sec
tiling/50              1.42    183.6±6.05ms        ? B/sec    1.00   129.3±15.20ms        ? B/sec

So pretty much across the board improvements, and the regressions are fairly minor. And all tests pass, so I'm happy!

Mr4k · 2019-11-26T04:58:10Z

Hey! First just wanted to say thanks so much for taking the time to review!

Second of all I just wanted to make sure that I am not actually making your code worse :) One thing I notice when I look at the numbers in your benchmark is that it seems like performance has regressed on larger images which worries me a little.

I'm curious what os / cpu you are using? I tried another macbook with an i7 and fewer cores (just 4 physical ones) and a digital ocean server (ubuntu) with an Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz and ended up seeing no performance regressions for larger images even well above 400x400 (still looks like a 25% speed up on average). Sadly I still have not found a non-intel processor to benchmark against.

I'm sure you're pretty busy but if you have the time I'd be really curious what results the following simple benchmarks give when run a couple times with and without this pr (I compared against the commit right before this one):

time cargo run --release -- --out out/01.jpg generate imgs/1.jpg

time cargo run --release -- --out out/01.jpg --out-size 1024 generate imgs/1.jpg

time cargo run --release -- --out out/01.jpg --out-size 2048 generate imgs/1.jpg

time cargo run --release -- --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg

Jake-Shadle · 2019-11-26T09:57:43Z

I'm on an Intel Xeon 3.3Ghz on Linux. Ping @h3r2tic who has a Threadripper he can try this on.

h3r2tic · 2019-11-26T10:40:21Z

Ooooh, cool :) I'll test it when I'm back home!

h3r2tic · 2019-11-27T08:48:40Z

I ran it on my TR 2990WX:

BEFORE:

commit 3f30bea86b2bc6435cd56805d2ae2f4124d766ec (HEAD, tag: 0.7.1)
Author: Jake Shadle <jake.shadle@embark-studios.com>
Date:   Tue Nov 19 15:29:51 2019 +0100

    Release 0.7.1

time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.485s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.472s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:00:15] ######################################## 100%
 stage   6 ######################################## 100%

real    0m15.673s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:04] ######################################## 100%
 stage   6 ######################################## 100%

real    1m5.580s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:04] ######################################## 100%
 stage   6 ######################################## 100%

real    0m4.473s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:04] ######################################## 100%
 stage   6 ######################################## 100%

real    0m4.459s
user    0m0.000s
sys     0m0.000s

AFTER:

commit 7196f75fc38963dadaa579c5a5bab3f5978eff9b (HEAD, origin/master, origin/HEAD, master)
Author: Jake Shadle <jake.shadle@embark-studios.com>
Date:   Mon Nov 25 14:23:21 2019 +0100

    Add CHANGELOG entry for PR#69

time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.132s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg generate imgs/1.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.179s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:00:14] ######################################## 100%
 stage   6 ######################################## 100%

real    0m14.559s
user    0m0.000s
sys     0m0.015s
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real    1m1.654s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.765s
user    0m0.000s
sys     0m0.000s
time target/release/texture-synthesis.exe --alpha 0.8 -o out/04.png transfer-style --style imgs/multiexample/4.jpg --guide imgs/tom.jpg
[00:00:03] ######################################## 100%
 stage   6 ######################################## 100%

real    0m3.788s
user    0m0.000s
sys     0m0.000s

Jake-Shadle · 2019-11-27T09:11:04Z

Interesting! Will have to look into why there's the slight regression on the larger ones on my xeon then.

h3r2tic · 2019-11-27T10:12:59Z

There's a slight regression here as well in the --out out/01.jpg --out-size 2048 generate imgs/1.jpg case.

Mr4k · 2019-11-27T16:15:54Z

Thanks for taking the time to benchmark this!

If I'm reading those numbers correctly it looks like your results for the 2048 case are:

Before:
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:04] ######################################## 100%
 stage   6 ######################################## 100%

real    1m5.580s
user    0m0.000s
sys     0m0.000s

After:
time target/release/texture-synthesis.exe --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real    1m1.654s
user    0m0.000s
sys     0m0.000s

So it looks like with this pr it ran in ~1:01 vs ~1:05 which doesn't seem to be a regression (though maybe I am misunderstanding).

However those results are really minor (in fact I'd almost be worried they were noise) compared to what I've gotten on my test machines.

For example here is my original test macbook:

Before:
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:17] ######################################## 100%
 stage   6 ######################################## 100%

real	1m17.687s
user	3m47.141s
sys	0m1.396s
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:17] ######################################## 100%
 stage   6 ######################################## 100%

real	1m17.759s
user	3m46.016s
sys	0m1.329s

time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:05:27] ######################################## 100%
 stage   6 ######################################## 100%

real	5m28.603s
user	15m35.845s
sys	0m5.621s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:05:23] ######################################## 100%
 stage   6 ######################################## 100%

real	5m25.009s
user	15m36.068s
sys	0m5.371s

After:
time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real	1m1.418s
user	2m53.323s
sys	0m1.160s

time target/release/texture-synthesis --out out/01.jpg --out-size 1024 generate imgs/1.jpg
[00:01:01] ######################################## 100%
 stage   6 ######################################## 100%

real	1m1.640s
user	2m53.534s
sys	0m1.157s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:04:16] ######################################## 100%
 stage   6 ######################################## 100%

real	4m17.655s
user	12m2.829s
sys	0m4.699s
time target/release/texture-synthesis --out out/01.jpg --out-size 2048 generate imgs/1.jpg
[00:04:13] ######################################## 100%
 stage   6 ######################################## 100%

real	4m15.168s
user	11m58.227s
sys	0m4.631s

If you have the time I'd be curious to know (but I also don't want to take up too much of your time):
@h3r2tic what is the clock speed of your cpu? One thing I have not been able to bench this on is high clock speed cpus and while it's not a lot of data the mac above has a clock speed of 2.9ghz and I only see 20% gains as opposed to > 25% on the other computers I've run it on with slower cpus (2.6 ghz) but more cores. Of course it could also be any number of other cpu differences as well. I also haven't tested on any extremely high core cpus (the most physical cores I've done is 8). I might try that as well when I get more time tonight. (At this point I'm just curious)
@Jake-Shadle what happens if you run these types of simple time commands as opposed to the bench suite?

h3r2tic · 2019-11-27T16:53:37Z

Ah, once again I find out that I can't even read xD You're obviously correct, @Mr4k, there was no regression there :)

It's a stock 2990wx (https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-2990wx), so just 3GHz, but it does have 32 cores.

Mr4k · 2019-11-28T06:32:46Z

Did a few tests with 8 cores (16 threads) and a 2.8ghz intel xeon processor using a google cloud server and noticed that improvements were 16-18% instead of 20-30%. Should be able to test with even more cores soon. It seems conceivable that benefits drop off with large number of threads (looks like you have 64). Still does not explain the regression though.

Mr4k added 2 commits November 21, 2019 19:15

small optimization

36db8db

removed in progress comment

7470223

Mr4k commented Nov 22, 2019

View reviewed changes

Mr4k added 3 commits November 21, 2019 23:14

optimized the optimization

c3eba95

fmt

622cdf5

all hail clippy

e9401c4

Mr4k changed the title ~~Read example pixels only when necessary~~ [Optimization] Read example pixels only when necessary Nov 22, 2019

Mr4k commented Nov 22, 2019

View reviewed changes

Jake-Shadle self-requested a review November 22, 2019 16:44

Mr4k added 2 commits November 22, 2019 09:04

faster for guide map images

9cb6f58

Revert "faster for guide map images"

bf78da0

This reverts commit 9cb6f58.

arirawr added the enhancement New feature or request label Nov 25, 2019

Jake-Shadle approved these changes Nov 25, 2019

View reviewed changes

Jake-Shadle merged commit b72aef5 into EmbarkStudios:master Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Read example pixels only when necessary #69

[Optimization] Read example pixels only when necessary #69

Mr4k commented Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019

Jake-Shadle commented Nov 22, 2019

Jake-Shadle left a comment

Mr4k commented Nov 26, 2019 •

edited

Loading

Jake-Shadle commented Nov 26, 2019

h3r2tic commented Nov 26, 2019

h3r2tic commented Nov 27, 2019

Jake-Shadle commented Nov 27, 2019

h3r2tic commented Nov 27, 2019

Mr4k commented Nov 27, 2019 •

edited

Loading

h3r2tic commented Nov 27, 2019

Mr4k commented Nov 28, 2019 •

edited

Loading

[Optimization] Read example pixels only when necessary #69

[Optimization] Read example pixels only when necessary #69

Conversation

Mr4k commented Nov 22, 2019 • edited Loading

Checklist

Description of Changes

Related Issues

Mr4k Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

Mr4k Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

Mr4k Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

Mr4k Nov 22, 2019

Choose a reason for hiding this comment

Jake-Shadle commented Nov 22, 2019

Jake-Shadle left a comment

Choose a reason for hiding this comment

Mr4k commented Nov 26, 2019 • edited Loading

Jake-Shadle commented Nov 26, 2019

h3r2tic commented Nov 26, 2019

h3r2tic commented Nov 27, 2019

Jake-Shadle commented Nov 27, 2019

h3r2tic commented Nov 27, 2019

Mr4k commented Nov 27, 2019 • edited Loading

h3r2tic commented Nov 27, 2019

Mr4k commented Nov 28, 2019 • edited Loading

Mr4k commented Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k Nov 22, 2019 •

edited

Loading

Mr4k commented Nov 26, 2019 •

edited

Loading

Mr4k commented Nov 27, 2019 •

edited

Loading

Mr4k commented Nov 28, 2019 •

edited

Loading