Using `std::simd` to speed-up `unfilter` for `Paeth` for bpp=3 and bpp=6 #414

anforowicz · 2023-09-25T18:34:34Z

PTAL?

I hope that we can land this series of 4 commits please. I understand that relying on unstable language/standard-library features means additional maintenance burden, but:

There is a precedent for using the unstable feature of the png crate in the past: 6c0b8fa (i.e. this PR doesn't introduce the unstable feature)
The risk of having to deal with breaking changes (which an unstable feature can introduce at any time) only affects library clients that explicitly opt into using the unstable feature.

Do you think it might be desirable to cover the nightly + unstable-feature-enabled configuration on CI of the png crate? FWIW Chromium uses the nightly compiler (rolled into Chromium toolchain every 1-2 weeks) and can also serve as a canary for detecting breakages (if/once we start depending on the png crate - this is still a work-in-progress and we are still evaluating Rust performance against the C/C++ implementation).

Note that I have tried to extend the std::simd approach to other bpp cases, but it failed to produce measurable improvements - see: 36b541b

Can you please provide your feedback on whether I should also try to make additional changes to simplify existing unfilter code? (@marshallpierce has pointed out that two versions of the code at https://godbolt.org/z/5MvssMncb compile to the same auto-vectorized code.) On one hand, simpler code seems nice (easier to read, less wrapping to fit 100 columns). OTOH, the simplification can be seen as an unnecessary change (and therefore unnecessary risk as auto-vectorization is a bit magical and difficult to test).

Results of running microbenchmarks on author's machine: ``` $ bench --bench=unfilter --features=benchmarks,unstable -- --baseline=my_baseline filter=Paeth/bpp=3 ... unfilter/filter=Paeth/bpp=3 time: [21.337 µs 21.379 µs 21.429 µs] thrpt: [546.86 MiB/s 548.14 MiB/s 549.22 MiB/s] change: time: [-42.023% -41.825% -41.619%] (p = 0.00 < 0.05) thrpt: [+71.288% +71.895% +72.482%] Performance has improved. ```

anforowicz · 2023-09-25T18:36:05Z

/cc @veluca93

okaneco · 2023-09-26T00:14:21Z

Regarding the last paragraph on simplification, it appears that the code on the left didn't autovectorize until Rust 1.72. The code on the right is autovectorized from Rust 1.65 onward.
https://godbolt.org/z/MKe4qE4sc (both at rustc 1.71)

fintelia · 2023-09-27T04:28:34Z

I just did some benchmarking of these changes on my CPU (a Ryzen 5600X). Overall, I do see substantial improvement in the Paeth unfiltering time (-21% for 3bpp and -38% for 6bpp) which is particularly impactful because it tends to be a bottleneck.

However, another thing I've discovered is that there is a severe performance bug in rustc 1.72.0 (rust-lang/rust#115212) which was resolved in rust-lang/rust#115236 and backported to rustc 1.72.1. Using a non-buggy compiler, the performance changes for Avg and Sub were quite mixed. Sub/bpp=6 did get 10% faster, but it already ran at nearly 12 GB/s so that's getting into diminishing returns.

anforowicz · 2023-09-27T16:57:24Z

Thanks for taking a look and pointing out that the impact of the changes is different on the latest nightly. I've re-tested with rustc 1.74.0-nightly (5ae769f06 2023-09-26) and I see the following changes in the new unfilter benchmark:

Significant Paeth improvements (time -21% for bpp=3 and -37% for bpp=6)
Small Avg improvement for bpp=3 (time -5%) and a big regression for bpp=6 (time +30%)
Small Sub improvement for bpp=6 (time -9%) and a big regression for bpp=3 (time +80%)

I've tried rerunning the benchmarks with --sample-size=1000 --measurement-time=50 (i.e. collecting 10 times more data) and the Avg/bpp=3 gains disappeared while Sub/bpp=6 gains persisted.

So, I am not sure how to proceed here:

Option 1: Redo/simplify this PR to just focus on Paeth (I think I'll just rebase + force push instead of fixing forward)
Option 2: Use old code for Avg and Sub/bpp=3, but still include the Sub/bpp=6 changes

WDYT? I think I'd lean toward Option 1, because bpp=6 seems less common than bpp=3 (I don't have hard data to back that up though) and because the gains for Sub/bpp=6 were relatively small (with default Criterion settings I measured -9%; when collecting x10 samples three times I got: -5.2%, -5.6%, -4.6%).

fintelia · 2023-09-27T17:33:52Z

I'd say go with option 1. It isn't just that sub filtering with bpp=6 is pretty rare. Another big factor is that because decoding time is split between decompression (which happens at roughly 1 GB/s) and unfiltering, Amdahl's law means that speeding up an already very fast unfiltering method will have only a very small impact on the total decoding time.

anforowicz · 2023-09-27T20:09:23Z

Okay - I switched to option 1 (std::simd-ification of only the Paeth unfilter for bpp=3 and bpp=6). PTAL?

FWIW some failures were reported at https://github.com/image-rs/image-png/actions/runs/6330253240:

mips_cross seems like an unrelated, known issue (based on Scaffolding for direct benchmarking of crate::filter::unfilter. #413 (comment))
test_all failure is introduced by this PR. It seems that the CI needs to be told to only test the unstable crate feature when using the unstable/nightly rustc.

src/filter.rs

@okaneco

This change has been suggested by @okaneco in image-rs#414 (comment) - thanks! Co-authored-by: Collyn O'Kane <47607823+okaneco@users.noreply.github.com>

okaneco · 2023-10-06T00:58:55Z

Thanks! I have trouble reading and following the disassembly, so I really appreciate your feedback here.

(BTW: Have you seen the comparison of C++ vs Rust disasembly posted in #416: https://godbolt.org/z/j8fK39oEE ?)

I haven't looked too closely at the algorithm differences, the intrinsics look like they should produce roughly the same code. More on that under the divider.

One noticeable difference in the Rust version is the generation of constants at the top. They looked like masks, so I tried to see what part of the code was making that. If the load function is changed, then a majority of them disappear which may help with performance (220 -> 162 lines, 4 less movs). There's one more label of data that we might be able to remove (edit: maybe not, it looks like the u8 cast for the predictor). I haven't tested or ran this code but it looks equivalent.
https://rust.godbolt.org/z/YzWMbdsco

fn load3(src: &[u8]) -> u8x4 {
    let mut temp = [0; 4];
    temp[..3].copy_from_slice(&src[..3]);
    u8x4::from_slice(&temp)
}

The manual abs produces the same assembly output in Rust. https://rust.godbolt.org/z/36jTz7e5d

fn i16x4_abs(mut a: i16x4) -> i16x4 {
    let zeros = i16x4::default();
    let is_negative = a.simd_lt(zeros).to_int().cast::<u16>();
    let xored = a.cast::<u16>() ^ is_negative;
    (xored - is_negative).cast::<i16>()
}

I wrote out the manual select, but I haven't tested or benchmarked this code. It uses one more packed compare than the select. https://rust.godbolt.org/z/cbYK5xchf

fn if_then_else(c: i16x4, t: i16x4, e: i16x4) -> i16x4 {
    (c & t) | (c & !e)
}
//     if_then_else(
//         smallest.simd_eq(pa).to_int(),
//         a,
//         if_then_else(smallest.simd_eq(pb).to_int(), b, c),
//     )

When looking at comparisons in godbolt, I make use of the maximize and minimize on the diff view and ctrl+f to get an idea of the differences in instructions. I also use the code preview in the scroll bar to get an idea of instruction count.

anforowicz · 2023-10-06T18:27:00Z

Thanks @okaneco! I've tried implementing your proposed changes in 97d4f63 and when measuring them (AMD EPYC 7B12, rustc 1.74.0-nightly (5ae769f06 2023-09-26)) I didn't see an improvement:

$ taskset --cpu-list 4-7 nice -n -19 target/release/deps/unfilter-a5b9c1f23e358474 --bench --baseline=simd1 Paeth/bpp=[36]
...
unfilter/filter=Paeth/bpp=3
                        time:   [18.635 µs 18.658 µs 18.677 µs]
                        thrpt:  [627.44 MiB/s 628.09 MiB/s 628.84 MiB/s]
                 change:
                        time:   [-0.8646% -0.5573% -0.2935%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2944% +0.5604% +0.8721%]
                        Change within noise threshold.
Found 30 outliers among 100 measurements (30.00%)
  15 (15.00%) low severe
  8 (8.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

unfilter/filter=Paeth/bpp=6
                        time:   [18.867 µs 18.875 µs 18.883 µs]
                        thrpt:  [1.2121 GiB/s 1.2126 GiB/s 1.2131 GiB/s]
                 change:
                        time:   [-0.3220% -0.1147% +0.0969%] (p = 0.30 > 0.05)
                        thrpt:  [-0.0968% +0.1148% +0.3230%]
                        No change in performance detected.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  6 (6.00%) high severe


lukasza2 in /usr/local/google/home/lukasza/src/github/image-png
$ taskset --cpu-list 4-7 nice -n -19 target/release/deps/unfilter-a5b9c1f23e358474 --bench --baseline=simd1 Paeth/bpp=[36]
...
unfilter/filter=Paeth/bpp=3
                        time:   [18.696 µs 18.699 µs 18.702 µs]
                        thrpt:  [626.59 MiB/s 626.71 MiB/s 626.81 MiB/s]
                 change:
                        time:   [+0.0355% +0.0812% +0.1249%] (p = 0.00 < 0.05)
                        thrpt:  [-0.1248% -0.0812% -0.0354%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

unfilter/filter=Paeth/bpp=6
                        time:   [18.874 µs 18.879 µs 18.885 µs]
                        thrpt:  [1.2120 GiB/s 1.2124 GiB/s 1.2127 GiB/s]
                 change:
                        time:   [-0.1953% -0.0306% +0.1590%] (p = 0.74 > 0.05)
                        thrpt:  [-0.1587% +0.0306% +0.1957%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

So, I think it's okay to stick with the current, simpler code in the PR.

okaneco · 2023-10-06T20:41:01Z

Thanks for trying that out. I prefer the simpler code as well, and we're probably at the limit of improvement for this area of code without greater architectural changes.

I noticed when adding if_then_else to the full unfilter program that it produced 2 more instructions. So even though the extra data was removed, the change to load3 is probably perf neutral-to-slightly-negative.

https://rust.godbolt.org/z/6hfzGYxqh (updated load3 on the left, if_then_else on right)

The other difference I could spot was that our casting contains unnecessary instructions. I wrongfully assumed casting would saturate when narrowing, but it's an as cast:

This follows the semantics of Rust’s as conversion for casting integers (wrapping to other integer types, and saturating to float types).
https://doc.rust-lang.org/stable/std/simd/trait.SimdInt.html#tymethod.cast

Since it follows as-cast semantics, the conversion looks like (i16 & 255) as u8. This results in the code having that label of 8 .short 255 and calling pand to clip to u8 before calling packus and padd. This probably isn't a huge bottleneck but it's unnecessary in our case (and seemingly unavoidable while using Simd casts) since we know the values are originally bytes.

In the C++ code, the addition can be done in place by reinterpreting the __m128i, previously treated as i16, as a vector of u8 with _mm_add_epi8. Then to store, they call the saturating pack instrinsic which narrows from i16 to u8. It'd be nice if we could safely reinterpret the Simd structs as if we were using intrinsics (such as going from i16x4 to u8x8 to perform wrapping addition), but I'm not aware of a way.

      // L100-102
      /* Note `_epi8`: we need addition to wrap modulo 255. */
      d = _mm_add_epi8(d, nearest);
      store3(row, _mm_packus_epi16(d,d));

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=packus&techs=SSE_ALL&ig_expand=4903

anforowicz · 2023-10-06T21:14:01Z

In the C++ code, the addition can be done in place by reinterpreting the __m128i, previously treated as i16, as a vector of u8 with _mm_add_epi8. Then to store, they call the saturating pack instrinsic which narrows from i16 to u8. It'd be nice if we could safely reinterpret the Simd structs as if we were using intrinsics (such as going from i16x4 to u8x8 to perform wrapping addition), but I'm not aware of a way.

I guess that in the long-term this may be solved by https://github.com/rust-lang/project-safe-transmute and in the short-term maybe bytemuck (or std::simd) can declare that Simd is AnyBitPattern + NoUninit?

okaneco · 2023-10-06T21:51:13Z

Yes, I believe this issue is relevant.
rust-lang/portable-simd#116

A to_bits/from_bits would be nice as mentioned in the discussion. The PR #368 in that repo might be making to_bytes features public but I'm not sure.

src/filter.rs

fintelia · 2023-10-09T17:14:19Z

Other than the CI issue of enabling the "unstable" feature with the stable compiler, is this ready to merge?

Results of running microbenchmarks on author's machine: ``` $ bench --bench=unfilter --features=unstable,benchmarks -- --baseline=my_baseline Paeth/bpp=6 ... unfilter/filter=Paeth/bpp=6 time: [22.346 µs 22.356 µs 22.367 µs] thrpt: [1.0233 GiB/s 1.0238 GiB/s 1.0242 GiB/s] change: time: [-24.033% -23.941% -23.852%] (p = 0.00 < 0.05) thrpt: [+31.323% +31.476% +31.637%] Performance has improved. ```

This refactoring is desirable because: * It removes a little bit of duplication between `unfilter_paeth3` and `unfilter_paeth6` * It helps in a follow-up CL, where we need to use `paeth_step` from more places.

… or 6). This CL loads RGB data using 4-bytes-wide loads (and RRGGBB data using 8-byte-wide loads), because: * This is faster as measured by the microbenchmarks below * It doesn't change the behavior - before and after these changes we were ignoring the 4th SIMD lane when processing RGB data (after this change the 4th SIMD lane will contain data from the next pixel, before this change it contained a 0 value) * This is safe as long as we have more than 4 bytes of remaining input data (we have to fall back to a 3-bytes-wide load for the last pixel). Results of running microbenchmarks on the author's machine: ``` $ bench --bench=unfilter --features=unstable,benchmarks -- --baseline=simd1 Paeth/bpp=[36] ... unfilter/filter=Paeth/bpp=3 time: [18.755 µs 18.761 µs 18.767 µs] thrpt: [624.44 MiB/s 624.65 MiB/s 624.83 MiB/s] change: time: [-16.148% -15.964% -15.751%] (p = 0.00 < 0.05) thrpt: [+18.696% +18.997% +19.258%] Performance has improved. ... unfilter/filter=Paeth/bpp=6 time: [18.991 µs 19.000 µs 19.009 µs] thrpt: [1.2041 GiB/s 1.2047 GiB/s 1.2052 GiB/s] change: time: [-15.161% -15.074% -14.987%] (p = 0.00 < 0.05) thrpt: [+17.629% +17.750% +17.871%] Performance has improved. ```

anforowicz · 2023-10-09T18:10:08Z

Other than the CI issue of enabling the "unstable" feature with the stable compiler, is this ready to merge?

Yes, I think so.

anforowicz · 2023-11-01T20:37:31Z

Quick disclaimer: this PR only helps for RGB8 and RGB16 images which (according to the notes here) account for only 3.9% of PNG images found in top 500 websites. Nevertheless, it's probably still desirable to merge these changes IMO.

anforowicz · 2023-11-01T20:39:15Z

BTW: sorry for being slow with this PR - I didn't realize that I have the power to change the CI configuration through the PR itself . I... err... don't have much experience with github workflows... :-/

anforowicz force-pushed the unfilter-6bpp-improvements branch from 788633a to f5021fb Compare September 27, 2023 18:57

anforowicz changed the title ~~Using std::simd to speed-up unfilter for Paeth, Avg, Sub for bpp=3 and bpp=6~~ Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 Sep 29, 2023

anforowicz mentioned this pull request Sep 29, 2023

Discussion: PNG decoding performance improvement opportunities #415

Closed

okaneco reviewed Sep 30, 2023

View reviewed changes

src/filter.rs Show resolved Hide resolved

fintelia reviewed Oct 9, 2023

View reviewed changes

src/filter.rs Outdated Show resolved Hide resolved

anforowicz added 3 commits October 9, 2023 18:07

Extract a separate struct PaethState and fn paeth_step.

63222f6

This refactoring is desirable because: * It removes a little bit of duplication between `unfilter_paeth3` and `unfilter_paeth6` * It helps in a follow-up CL, where we need to use `paeth_step` from more places.

anforowicz force-pushed the unfilter-6bpp-improvements branch from 403d18f to 22295a5 Compare October 9, 2023 18:09

Tweak CI to avoid testing the unstable feature with stable rustc

c9a7327

fintelia merged commit 1825c7e into image-rs:master Nov 2, 2023
18 of 19 checks passed

anforowicz deleted the unfilter-6bpp-improvements branch November 2, 2023 01:34

fintelia mentioned this pull request Dec 19, 2023

Question: Using SIMD in Image? image-rs/image#2077

Closed

anforowicz mentioned this pull request Feb 1, 2024

Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `std::simd` to speed-up `unfilter` for `Paeth` for bpp=3 and bpp=6 #414

Using `std::simd` to speed-up `unfilter` for `Paeth` for bpp=3 and bpp=6 #414

anforowicz commented Sep 25, 2023

anforowicz commented Sep 25, 2023

okaneco commented Sep 26, 2023

fintelia commented Sep 27, 2023 •

edited

Loading

anforowicz commented Sep 27, 2023

fintelia commented Sep 27, 2023

anforowicz commented Sep 27, 2023

okaneco commented Oct 6, 2023 •

edited

Loading

anforowicz commented Oct 6, 2023

okaneco commented Oct 6, 2023 •

edited

Loading

anforowicz commented Oct 6, 2023

okaneco commented Oct 6, 2023

fintelia commented Oct 9, 2023

anforowicz commented Oct 9, 2023

anforowicz commented Nov 1, 2023

anforowicz commented Nov 1, 2023

Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414

Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 #414

Conversation

anforowicz commented Sep 25, 2023

anforowicz commented Sep 25, 2023

okaneco commented Sep 26, 2023

fintelia commented Sep 27, 2023 • edited Loading

anforowicz commented Sep 27, 2023

fintelia commented Sep 27, 2023

anforowicz commented Sep 27, 2023

okaneco commented Oct 6, 2023 • edited Loading

anforowicz commented Oct 6, 2023

okaneco commented Oct 6, 2023 • edited Loading

anforowicz commented Oct 6, 2023

okaneco commented Oct 6, 2023

fintelia commented Oct 9, 2023

anforowicz commented Oct 9, 2023

anforowicz commented Nov 1, 2023

anforowicz commented Nov 1, 2023

Using `std::simd` to speed-up `unfilter` for `Paeth` for bpp=3 and bpp=6 #414

Using `std::simd` to speed-up `unfilter` for `Paeth` for bpp=3 and bpp=6 #414

fintelia commented Sep 27, 2023 •

edited

Loading

okaneco commented Oct 6, 2023 •

edited

Loading

okaneco commented Oct 6, 2023 •

edited

Loading