-
Notifications
You must be signed in to change notification settings - Fork 83
[Optimization] Read example pixels only when necessary #69
[Optimization] Read example pixels only when necessary #69
Conversation
); | ||
|
||
//get example pattern to compare to | ||
k_neighs_to_color_pattern( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this call to generate my_guide_pattern
was in the candidates loop. It seemed to not matter so I moved it out
} | ||
} | ||
|
||
fn k_neighs_to_precomputed_reference_pattern( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed this function to differentiate it from the now on the fly read candidate pixels, this function is now only called to build my_pattern
and my_guide_pattern
} | ||
} | ||
score += next_pixel_score * distance_gaussians[i]; | ||
if score >= current_best { | ||
return None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the heart of the speed up is because we don't read more pixels if we return early
} | ||
} | ||
score += next_pixel_score * distance_gaussians[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably shrink distance gaussians from size (num neighbors) * 4 to just num neighbors as that is all I'm using here
I unfortunately didn't have time to review this today, but I will check it out Monday, thanks for the PR! |
This reverts commit 9cb6f58.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, here are the numbers from the baselines I've been running:
group pr-33 pr-69
----- ----- -----
guided/100 1.74 389.0±9.75ms ? B/sec 1.00 224.2±9.99ms ? B/sec
guided/200 1.58 730.3±4.26ms ? B/sec 1.00 463.7±6.23ms ? B/sec
guided/25 1.86 157.0±5.80ms ? B/sec 1.00 84.4±4.05ms ? B/sec
guided/400 1.41 2.9±0.01s ? B/sec 1.00 2.0±0.08s ? B/sec
guided/50 2.03 321.6±6.41ms ? B/sec 1.00 158.4±7.57ms ? B/sec
inpaint/100 1.38 183.5±7.84ms ? B/sec 1.00 133.2±7.17ms ? B/sec
inpaint/200 1.18 222.9±7.21ms ? B/sec 1.00 188.6±5.52ms ? B/sec
inpaint/25 1.44 44.0±1.20ms ? B/sec 1.00 30.6±1.07ms ? B/sec
inpaint/400 1.00 535.5±7.99ms ? B/sec 1.12 598.6±9.36ms ? B/sec
inpaint/50 1.48 181.2±7.72ms ? B/sec 1.00 122.7±7.48ms ? B/sec
inpaint_channel/100 1.00 200.2±8.54ms ? B/sec
inpaint_channel/200 1.00 597.6±51.78ms ? B/sec
inpaint_channel/25 1.00 17.7±0.11ms ? B/sec
inpaint_channel/400 1.00 473.3±8.34ms ? B/sec
inpaint_channel/50 1.00 68.7±7.33ms ? B/sec
multi_example/100 1.24 225.3±3.79ms ? B/sec 1.00 182.4±6.30ms ? B/sec
multi_example/200 1.00 445.4±9.75ms ? B/sec 1.04 465.3±8.19ms ? B/sec
multi_example/25 1.54 98.0±4.69ms ? B/sec 1.00 63.5±11.85ms ? B/sec
multi_example/400 1.00 1834.9±37.90ms ? B/sec 1.46 2.7±0.10s ? B/sec
multi_example/50 1.53 182.5±3.27ms ? B/sec 1.00 119.4±4.23ms ? B/sec
single_example/100 1.18 208.5±4.18ms ? B/sec 1.00 176.3±13.28ms ? B/sec
single_example/200 1.00 425.8±4.54ms ? B/sec 1.23 523.7±10.91ms ? B/sec
single_example/25 1.87 84.1±4.66ms ? B/sec 1.00 45.1±2.80ms ? B/sec
single_example/400 1.00 1709.8±12.95ms ? B/sec 1.55 2.6±0.04s ? B/sec
single_example/50 1.44 176.7±4.70ms ? B/sec 1.00 122.3±12.19ms ? B/sec
style_transfer/100 1.64 383.9±5.37ms ? B/sec 1.00 234.7±8.10ms ? B/sec
style_transfer/200 1.51 740.4±3.05ms ? B/sec 1.00 489.6±5.67ms ? B/sec
style_transfer/25 1.80 148.2±5.87ms ? B/sec 1.00 82.6±5.66ms ? B/sec
style_transfer/400 1.51 2.9±0.01s ? B/sec 1.00 1924.1±42.17ms ? B/sec
style_transfer/50 1.67 300.1±9.62ms ? B/sec 1.00 180.2±11.65ms ? B/sec
tiling/100 1.32 233.5±6.02ms ? B/sec 1.00 177.2±5.26ms ? B/sec
tiling/200 1.00 396.7±4.94ms ? B/sec 1.11 439.9±9.27ms ? B/sec
tiling/25 1.66 77.6±3.56ms ? B/sec 1.00 46.7±2.91ms ? B/sec
tiling/400 1.00 1518.5±39.65ms ? B/sec 1.39 2.1±0.04s ? B/sec
tiling/50 1.42 183.6±6.05ms ? B/sec 1.00 129.3±15.20ms ? B/sec
So pretty much across the board improvements, and the regressions are fairly minor. And all tests pass, so I'm happy!
Hey! First just wanted to say thanks so much for taking the time to review! Second of all I just wanted to make sure that I am not actually making your code worse :) One thing I notice when I look at the numbers in your benchmark is that it seems like performance has regressed on larger images which worries me a little. I'm curious what os / cpu you are using? I tried another macbook with an i7 and fewer cores (just 4 physical ones) and a digital ocean server (ubuntu) with an Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz and ended up seeing no performance regressions for larger images even well above 400x400 (still looks like a 25% speed up on average). Sadly I still have not found a non-intel processor to benchmark against. I'm sure you're pretty busy but if you have the time I'd be really curious what results the following simple benchmarks give when run a couple times with and without this pr (I compared against the commit right before this one):
|
I'm on an Intel Xeon 3.3Ghz on Linux. Ping @h3r2tic who has a Threadripper he can try this on. |
Ooooh, cool :) I'll test it when I'm back home! |
I ran it on my TR 2990WX:
|
Interesting! Will have to look into why there's the slight regression on the larger ones on my xeon then. |
There's a slight regression here as well in the |
Thanks for taking the time to benchmark this! If I'm reading those numbers correctly it looks like your results for the 2048 case are:
So it looks like with this pr it ran in ~1:01 vs ~1:05 which doesn't seem to be a regression (though maybe I am misunderstanding). However those results are really minor (in fact I'd almost be worried they were noise) compared to what I've gotten on my test machines. For example here is my original test macbook:
If you have the time I'd be curious to know (but I also don't want to take up too much of your time): |
Ah, once again I find out that I can't even read xD You're obviously correct, @Mr4k, there was no regression there :) It's a stock 2990wx (https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-2990wx), so just 3GHz, but it does have 32 cores. |
Did a few tests with 8 cores (16 threads) and a 2.8ghz intel xeon processor using a google cloud server and noticed that improvements were 16-18% instead of 20-30%. Should be able to test with even more cores soon. It seems conceivable that benefits drop off with large number of threads (looks like you have 64). Still does not explain the regression though. |
Checklist
Description of Changes
I'm not sure if this is the kind of contribution you are looking for but I did a little profiling (using instruments, screenshot below) and found out that (on my computer at least) a large amount of time was being taken up by the function
k_neighs_to_color_pattern
when creating the candidate patterns. It appears that looking up the pixels in the example images is somewhat costly and it is done in an inner loop which ends up contributing significantly to runtime.To try to cut down on this cost I moved pixel lookups for the candidate's neighbors into the
better_match
function. I only read each pixel right before it needs to be used in the cost function. This means because are you already stopping a lot of the cost computations early (when the current candidate cost exceeds the smallest candidate cost so far) fewer pixel lookups are performed. This does not change the algorithm at all.This results in around a 14% - 45% speed up (according to your benchmark test suite on my computer) depending on the example image(s) used and size of output texture. The average speed up seems to be more in the range of 14 - 25% (very unscientifically computed). I assume there are pathological cases where no performance gain could occur but I think they would be rare.
About my computer:
MacBook Pro (Retina, 13-inch, Early 2015)
Processor: 2.9 GHz Intel Core i5 (4 logical cores)
Memory: 8 GB 1867 MHz DDR3
Edit: Additionally tested with a Macbook Pro 2018
2.6 Ghz Intel Core i7 (12 logical cores)
32 GB 2400 MHz DDR4
I also tried to change the code minimally but there was some refactoring.
Disclaimer:
I have not tested this on a wide variety of devices or high end cpus
Related Issues
I don't think this is related to any open issues.