-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using std::simd
to speed-up unfilter
for Paeth
for bpp=3 and bpp=6
#414
Using std::simd
to speed-up unfilter
for Paeth
for bpp=3 and bpp=6
#414
Conversation
Results of running microbenchmarks on author's machine: ``` $ bench --bench=unfilter --features=benchmarks,unstable -- --baseline=my_baseline filter=Paeth/bpp=3 ... unfilter/filter=Paeth/bpp=3 time: [21.337 µs 21.379 µs 21.429 µs] thrpt: [546.86 MiB/s 548.14 MiB/s 549.22 MiB/s] change: time: [-42.023% -41.825% -41.619%] (p = 0.00 < 0.05) thrpt: [+71.288% +71.895% +72.482%] Performance has improved. ```
/cc @veluca93 |
Regarding the last paragraph on simplification, it appears that the code on the left didn't autovectorize until Rust 1.72. The code on the right is autovectorized from Rust 1.65 onward. |
I just did some benchmarking of these changes on my CPU (a Ryzen 5600X). Overall, I do see substantial improvement in the Paeth unfiltering time (-21% for 3bpp and -38% for 6bpp) which is particularly impactful because it tends to be a bottleneck. However, another thing I've discovered is that there is a severe performance bug in |
Thanks for taking a look and pointing out that the impact of the changes is different on the latest nightly. I've re-tested with
I've tried rerunning the benchmarks with So, I am not sure how to proceed here:
WDYT? I think I'd lean toward Option 1, because bpp=6 seems less common than bpp=3 (I don't have hard data to back that up though) and because the gains for Sub/bpp=6 were relatively small (with default Criterion settings I measured -9%; when collecting x10 samples three times I got: -5.2%, -5.6%, -4.6%). |
I'd say go with option 1. It isn't just that sub filtering with bpp=6 is pretty rare. Another big factor is that because decoding time is split between decompression (which happens at roughly 1 GB/s) and unfiltering, Amdahl's law means that speeding up an already very fast unfiltering method will have only a very small impact on the total decoding time. |
788633a
to
f5021fb
Compare
Okay - I switched to option 1 ( FWIW some failures were reported at https://github.com/image-rs/image-png/actions/runs/6330253240:
|
std::simd
to speed-up unfilter
for Paeth
, Avg
, Sub
for bpp=3 and bpp=6std::simd
to speed-up unfilter
for Paeth
for bpp=3 and bpp=6
This change has been suggested by @okaneco in image-rs#414 (comment) - thanks! Co-authored-by: Collyn O'Kane <47607823+okaneco@users.noreply.github.com>
I haven't looked too closely at the algorithm differences, the intrinsics look like they should produce roughly the same code. More on that under the divider. One noticeable difference in the Rust version is the generation of constants at the top. They looked like masks, so I tried to see what part of the code was making that. If the load function is changed, then a majority of them disappear which may help with performance (220 -> 162 lines, 4 less movs). There's one more label of data that we might be able to remove (edit: maybe not, it looks like the fn load3(src: &[u8]) -> u8x4 {
let mut temp = [0; 4];
temp[..3].copy_from_slice(&src[..3]);
u8x4::from_slice(&temp)
} The manual fn i16x4_abs(mut a: i16x4) -> i16x4 {
let zeros = i16x4::default();
let is_negative = a.simd_lt(zeros).to_int().cast::<u16>();
let xored = a.cast::<u16>() ^ is_negative;
(xored - is_negative).cast::<i16>()
} I wrote out the manual select, but I haven't tested or benchmarked this code. It uses one more packed compare than the fn if_then_else(c: i16x4, t: i16x4, e: i16x4) -> i16x4 {
(c & t) | (c & !e)
}
// if_then_else(
// smallest.simd_eq(pa).to_int(),
// a,
// if_then_else(smallest.simd_eq(pb).to_int(), b, c),
// ) When looking at comparisons in godbolt, I make use of the maximize and minimize on the diff view and ctrl+f to get an idea of the differences in instructions. I also use the code preview in the scroll bar to get an idea of instruction count. |
Thanks @okaneco! I've tried implementing your proposed changes in 97d4f63 and when measuring them (AMD EPYC 7B12, rustc 1.74.0-nightly (5ae769f06 2023-09-26)) I didn't see an improvement:
So, I think it's okay to stick with the current, simpler code in the PR. |
Thanks for trying that out. I prefer the simpler code as well, and we're probably at the limit of improvement for this area of code without greater architectural changes. I noticed when adding https://rust.godbolt.org/z/6hfzGYxqh (updated The other difference I could spot was that our casting contains unnecessary instructions. I wrongfully assumed casting would saturate when narrowing, but it's an
Since it follows In the C++ code, the addition can be done in place by reinterpreting the // L100-102
/* Note `_epi8`: we need addition to wrap modulo 255. */
d = _mm_add_epi8(d, nearest);
store3(row, _mm_packus_epi16(d,d)); |
I guess that in the long-term this may be solved by https://github.com/rust-lang/project-safe-transmute and in the short-term maybe |
Yes, I believe this issue is relevant. A |
Other than the CI issue of enabling the "unstable" feature with the stable compiler, is this ready to merge? |
Results of running microbenchmarks on author's machine: ``` $ bench --bench=unfilter --features=unstable,benchmarks -- --baseline=my_baseline Paeth/bpp=6 ... unfilter/filter=Paeth/bpp=6 time: [22.346 µs 22.356 µs 22.367 µs] thrpt: [1.0233 GiB/s 1.0238 GiB/s 1.0242 GiB/s] change: time: [-24.033% -23.941% -23.852%] (p = 0.00 < 0.05) thrpt: [+31.323% +31.476% +31.637%] Performance has improved. ```
This refactoring is desirable because: * It removes a little bit of duplication between `unfilter_paeth3` and `unfilter_paeth6` * It helps in a follow-up CL, where we need to use `paeth_step` from more places.
… or 6). This CL loads RGB data using 4-bytes-wide loads (and RRGGBB data using 8-byte-wide loads), because: * This is faster as measured by the microbenchmarks below * It doesn't change the behavior - before and after these changes we were ignoring the 4th SIMD lane when processing RGB data (after this change the 4th SIMD lane will contain data from the next pixel, before this change it contained a 0 value) * This is safe as long as we have more than 4 bytes of remaining input data (we have to fall back to a 3-bytes-wide load for the last pixel). Results of running microbenchmarks on the author's machine: ``` $ bench --bench=unfilter --features=unstable,benchmarks -- --baseline=simd1 Paeth/bpp=[36] ... unfilter/filter=Paeth/bpp=3 time: [18.755 µs 18.761 µs 18.767 µs] thrpt: [624.44 MiB/s 624.65 MiB/s 624.83 MiB/s] change: time: [-16.148% -15.964% -15.751%] (p = 0.00 < 0.05) thrpt: [+18.696% +18.997% +19.258%] Performance has improved. ... unfilter/filter=Paeth/bpp=6 time: [18.991 µs 19.000 µs 19.009 µs] thrpt: [1.2041 GiB/s 1.2047 GiB/s 1.2052 GiB/s] change: time: [-15.161% -15.074% -14.987%] (p = 0.00 < 0.05) thrpt: [+17.629% +17.750% +17.871%] Performance has improved. ```
403d18f
to
22295a5
Compare
Yes, I think so. |
Quick disclaimer: this PR only helps for RGB8 and RGB16 images which (according to the notes here) account for only 3.9% of PNG images found in top 500 websites. Nevertheless, it's probably still desirable to merge these changes IMO. |
BTW: sorry for being slow with this PR - I didn't realize that I have the power to change the CI configuration through the PR itself . I... err... don't have much experience with github workflows... :-/ |
PTAL?
I hope that we can land this series of 4 commits please. I understand that relying on unstable language/standard-library features means additional maintenance burden, but:
unstable
feature of thepng
crate in the past: 6c0b8fa (i.e. this PR doesn't introduce theunstable
feature)unstable
feature.Do you think it might be desirable to cover the nightly +
unstable
-feature-enabled configuration on CI of thepng
crate? FWIW Chromium uses the nightly compiler (rolled into Chromium toolchain every 1-2 weeks) and can also serve as a canary for detecting breakages (if/once we start depending on thepng
crate - this is still a work-in-progress and we are still evaluating Rust performance against the C/C++ implementation).Note that I have tried to extend the
std::simd
approach to other bpp cases, but it failed to produce measurable improvements - see: 36b541bCan you please provide your feedback on whether I should also try to make additional changes to simplify existing
unfilter
code? (@marshallpierce has pointed out that two versions of the code at https://godbolt.org/z/5MvssMncb compile to the same auto-vectorized code.) On one hand, simpler code seems nice (easier to read, less wrapping to fit 100 columns). OTOH, the simplification can be seen as an unnecessary change (and therefore unnecessary risk as auto-vectorization is a bit magical and difficult to test).