Implement SIMD support and add `wide` integration #278

Ogeon · 2022-04-02T14:04:15Z

This adds initial support for SIMD types in most places. An exception is the Luv related types, where the conversion logic need extra attention. Some of the conversions aren't necessarily optimal but the focus was on making it work at all.

Integration with the wide crate has been added behind a feature flag, as a first example. More SIMD crates can be added in the future.

Breaking Change

Some functions that used to return bool is now returning a mask type. This mask type is still bool for regular floats and ints, so this change will mostly affect generic code. GetHue was also changed to no longer return Option<T> for SIMD friendliness.

github-actions · 2022-04-02T14:25:09Z

Benchmark for `780844c`

Click to view benchmark

Test	Base	PR	%
Cie family/lab to lch	2.9±0.07µs	2.9±0.08µs	0.00%
Cie family/lab to xyz	733.0±15.20ns	732.5±15.26ns	-0.07%
Cie family/lch to lab	2.1±0.05µs	2.1±0.05µs	0.00%
Cie family/linsrgb to xyz	3.3±0.06µs	3.2±0.07µs	-3.03%
Cie family/xyz to lab	16.4±0.32µs	16.4±0.47µs	0.00%
Cie family/xyz to yxy	554.9±14.91ns	473.2±9.12ns	-14.72%
Cie family/yxy to xyz	473.3±16.92ns	446.1±8.45ns	-5.75%
Matrix functions/matrix_inverse	9.6±0.33ns	9.3±0.19ns	-3.12%
Matrix functions/multiply_3x3	12.8±0.26ns	12.8±0.32ns	0.00%
Matrix functions/multiply_rgb_to_xyz	5.9±0.14ns	5.9±0.24ns	0.00%
Matrix functions/multiply_xyz	5.9±0.25ns	5.9±0.20ns	0.00%
Matrix functions/multiply_xyz_to_rgb	5.9±0.15ns	5.9±0.17ns	0.00%
Matrix functions/rgb_to_xyz_matrix	20.1±0.38ns	20.2±0.77ns	+0.50%
Rgb family/hsl to hsv	556.0±17.99ns	556.6±20.13ns	+0.11%
Rgb family/hsl to linear hsl	8.8±0.17µs	10.4±0.20µs	+18.18%
Rgb family/hsl to rgb	2.0±0.05µs	2.1±0.04µs	+5.00%
Rgb family/hsv to hsl	936.2±19.63ns	1261.8±24.21ns	+34.78%
Rgb family/hsv to hwb	205.4±3.92ns	205.8±4.61ns	+0.19%
Rgb family/hsv to linear hsv	8.8±0.20µs	9.9±0.37µs	+12.50%
Rgb family/hsv to rgb	1996.5±52.13ns	2.0±0.05µs	+0.18%
Rgb family/hwb to hsv	425.7±8.34ns	425.8±9.23ns	+0.02%
Rgb family/hwb to linear hwb	9.9±0.29µs	10.4±0.42µs	+5.05%
Rgb family/linear hsl to hsl	10.0±0.40µs	11.6±0.25µs	+16.00%
Rgb family/linear hsv to hsv	9.0±0.20µs	11.0±0.32µs	+22.22%
Rgb family/linear hwb to hwb	10.0±0.23µs	11.6±0.46µs	+16.00%
Rgb family/linsrgb to rgb	5.5±0.13µs	5.5±0.12µs	0.00%
Rgb family/linsrgb_f32 to rgb_u8	6.1±0.13µs	6.1±0.19µs	0.00%
Rgb family/rgb to hsl	746.6±13.20ns	1216.8±33.13ns	+62.98%
Rgb family/rgb to hsv	603.3±14.15ns	1152.6±30.72ns	+91.05%
Rgb family/rgb to linsrgb	5.2±0.12µs	5.2±0.12µs	0.00%
Rgb family/rgb_u8 to linsrgb_f32	5.7±0.12µs	5.7±0.25µs	0.00%
Rgb family/xyz to linsrgb	5.0±0.10µs	5.0±0.23µs	0.00%

github-actions · 2022-04-02T15:10:18Z

Benchmark for `7787441`

Click to view benchmark

Test	Base	PR	%
Cie family/lab to lch	3.3±0.09µs	3.3±0.05µs	0.00%
Cie family/lab to xyz	829.1±12.54ns	829.8±11.34ns	+0.08%
Cie family/lch to lab	2.4±0.04µs	2.4±0.04µs	0.00%
Cie family/linsrgb to xyz	3.7±0.06µs	3.7±0.07µs	0.00%
Cie family/xyz to lab	18.6±0.41µs	18.6±0.53µs	0.00%
Cie family/xyz to yxy	632.6±21.42ns	534.1±9.35ns	-15.57%
Cie family/yxy to xyz	532.5±8.47ns	504.6±7.63ns	-5.24%
Matrix functions/matrix_inverse	10.5±0.18ns	10.5±0.14ns	0.00%
Matrix functions/multiply_3x3	14.5±0.37ns	14.5±0.20ns	0.00%
Matrix functions/multiply_rgb_to_xyz	6.6±0.12ns	6.6±0.15ns	0.00%
Matrix functions/multiply_xyz	6.6±0.11ns	6.6±0.11ns	0.00%
Matrix functions/multiply_xyz_to_rgb	6.6±0.12ns	6.6±0.08ns	0.00%
Matrix functions/rgb_to_xyz_matrix	22.8±0.42ns	23.0±1.43ns	+0.88%
Rgb family/hsl to hsv	624.6±8.23ns	587.1±8.71ns	-6.00%
Rgb family/hsl to linear hsl	10.0±0.15µs	11.6±0.23µs	+16.00%
Rgb family/hsl to rgb	2.3±0.03µs	2.4±0.06µs	+4.35%
Rgb family/hsv to hsl	1045.7±19.40ns	1340.4±31.16ns	+28.18%
Rgb family/hsv to hwb	232.9±5.72ns	232.4±3.39ns	-0.21%
Rgb family/hsv to linear hsv	10.0±0.20µs	11.0±0.26µs	+10.00%
Rgb family/hsv to rgb	2.3±0.04µs	2.3±0.04µs	0.00%
Rgb family/hwb to hsv	482.8±8.75ns	482.8±8.13ns	0.00%
Rgb family/hwb to linear hwb	11.2±0.22µs	11.5±0.15µs	+2.68%
Rgb family/linear hsl to hsl	11.3±0.20µs	13.1±0.21µs	+15.93%
Rgb family/linear hsv to hsv	10.2±0.17µs	12.3±0.58µs	+20.59%
Rgb family/linear hwb to hwb	11.3±0.17µs	12.9±0.24µs	+14.16%
Rgb family/linsrgb to rgb	6.2±0.08µs	6.2±0.29µs	0.00%
Rgb family/linsrgb_f32 to rgb_u8	6.9±0.10µs	6.9±0.12µs	0.00%
Rgb family/rgb to hsl	835.5±17.86ns	1246.2±17.09ns	+49.16%
Rgb family/rgb to hsv	687.4±14.40ns	1234.3±23.89ns	+79.56%
Rgb family/rgb to linsrgb	6.0±0.14µs	6.0±0.13µs	0.00%
Rgb family/rgb_u8 to linsrgb_f32	6.4±0.09µs	6.4±0.12µs	0.00%
Rgb family/xyz to linsrgb	5.6±0.07µs	5.6±0.08µs	0.00%

github-actions · 2022-04-02T15:46:32Z

Benchmark for `50c6381`

Click to view benchmark

Test	Base	PR	%
Cie family/lab to lch	4.0±0.22µs	3.9±0.20µs	-2.50%
Cie family/lab to xyz	1015.0±38.42ns	1008.9±45.37ns	-0.60%
Cie family/lch to lab	2.9±0.28µs	2.9±0.12µs	0.00%
Cie family/linsrgb to xyz	4.4±0.13µs	4.5±0.17µs	+2.27%
Cie family/xyz to lab	22.5±0.72µs	22.9±1.11µs	+1.78%
Cie family/xyz to yxy	783.3±37.36ns	652.0±25.90ns	-16.76%
Cie family/yxy to xyz	646.9±19.93ns	618.8±36.49ns	-4.34%
Matrix functions/matrix_inverse	12.9±0.49ns	12.9±0.42ns	0.00%
Matrix functions/multiply_3x3	17.8±1.08ns	17.6±0.60ns	-1.12%
Matrix functions/multiply_rgb_to_xyz	8.1±0.30ns	8.1±0.37ns	0.00%
Matrix functions/multiply_xyz	8.1±0.49ns	8.0±0.39ns	-1.23%
Matrix functions/multiply_xyz_to_rgb	8.1±0.34ns	8.0±0.29ns	-1.23%
Matrix functions/rgb_to_xyz_matrix	27.7±1.32ns	27.5±1.00ns	-0.72%
Rgb family/hsl to hsv	760.1±30.94ns	761.8±30.63ns	+0.22%
Rgb family/hsl to linear hsl	12.4±1.19µs	14.2±0.72µs	+14.52%
Rgb family/hsl to rgb	2.8±0.11µs	2.9±0.33µs	+3.57%
Rgb family/hsv to hsl	1274.7±48.60ns	1458.4±60.08ns	+14.41%
Rgb family/hsv to hwb	284.5±14.22ns	283.2±8.62ns	-0.46%
Rgb family/hsv to linear hsv	12.2±0.50µs	13.2±0.58µs	+8.20%
Rgb family/hsv to rgb	2.8±0.14µs	2.7±0.10µs	-3.57%
Rgb family/hwb to hsv	587.8±29.31ns	763.5±30.92ns	+29.89%
Rgb family/hwb to linear hwb	13.6±0.56µs	14.2±0.60µs	+4.41%
Rgb family/linear hsl to hsl	13.9±0.53µs	15.3±1.18µs	+10.07%
Rgb family/linear hsv to hsv	12.6±0.66µs	15.6±0.66µs	+23.81%
Rgb family/linear hwb to hwb	14.2±0.63µs	16.4±0.75µs	+15.49%
Rgb family/linsrgb to rgb	7.5±0.25µs	7.6±0.37µs	+1.33%
Rgb family/linsrgb_f32 to rgb_u8	8.3±0.23µs	8.3±0.31µs	0.00%
Rgb family/rgb to hsl	1037.9±47.12ns	1528.1±58.44ns	+47.23%
Rgb family/rgb to hsv	830.8±30.53ns	1523.1±164.19ns	+83.33%
Rgb family/rgb to linsrgb	7.3±0.42µs	7.3±0.41µs	0.00%
Rgb family/rgb_u8 to linsrgb_f32	7.8±0.39µs	8.0±0.61µs	+2.56%
Rgb family/xyz to linsrgb	6.9±0.32µs	7.5±0.40µs	+8.70%

Ogeon · 2022-04-02T16:56:30Z

It's a bummer that the RGB to HSL and RGB to HSV conversion is so much slower. I'll try with the old one behind type ID checks (i.e. Great Value Specialization) for now and see if it works better. I should see if I can add benchmarks for the SIMD versions before merging this.

github-actions · 2022-04-02T17:13:50Z

Benchmark for `48c254f`

Click to view benchmark

Test	Base	PR	%
Cie family/lab to lch	3.2±0.17µs	3.1±0.21µs	-3.13%
Cie family/lab to xyz	799.5±45.43ns	780.7±46.96ns	-2.35%
Cie family/lch to lab	2.3±0.13µs	2.2±0.13µs	-4.35%
Cie family/linsrgb to xyz	3.5±0.31µs	3.5±0.25µs	0.00%
Cie family/xyz to lab	16.9±0.93µs	18.3±2.05µs	+8.28%
Cie family/xyz to yxy	608.9±34.74ns	524.6±114.27ns	-13.84%
Cie family/yxy to xyz	511.8±30.11ns	481.6±28.70ns	-5.90%
Matrix functions/matrix_inverse	9.8±0.63ns	9.7±0.55ns	-1.02%
Matrix functions/multiply_3x3	13.5±0.87ns	13.5±1.42ns	0.00%
Matrix functions/multiply_rgb_to_xyz	6.3±0.41ns	6.3±0.40ns	0.00%
Matrix functions/multiply_xyz	6.1±0.36ns	5.9±0.27ns	-3.28%
Matrix functions/multiply_xyz_to_rgb	6.3±0.37ns	6.2±0.38ns	-1.59%
Matrix functions/rgb_to_xyz_matrix	21.1±2.25ns	21.4±1.78ns	+1.42%
Rgb family/hsl to hsv	580.8±39.71ns	632.4±41.40ns	+8.88%
Rgb family/hsl to linear hsl	9.3±0.59µs	10.4±1.43µs	+11.83%
Rgb family/hsl to rgb	2.2±0.43µs	2.3±0.12µs	+4.55%
Rgb family/hsv to hsl	1005.7±65.28ns	1218.5±65.26ns	+21.16%
Rgb family/hsv to hwb	218.0±13.57ns	218.8±25.70ns	+0.37%
Rgb family/hsv to linear hsv	9.8±2.69µs	9.5±0.87µs	-3.06%
Rgb family/hsv to rgb	2.1±0.13µs	2.1±0.13µs	0.00%
Rgb family/hwb to hsv	450.9±31.24ns	548.0±32.73ns	+21.53%
Rgb family/hwb to linear hwb	10.3±0.63µs	10.3±0.59µs	0.00%
Rgb family/linear hsl to hsl	10.7±0.67µs	10.8±0.94µs	+0.93%
Rgb family/linear hsv to hsv	9.6±0.59µs	9.8±0.53µs	+2.08%
Rgb family/linear hwb to hwb	10.8±0.65µs	10.6±1.00µs	-1.85%
Rgb family/linsrgb to rgb	5.8±0.37µs	5.7±0.35µs	-1.72%
Rgb family/linsrgb_f32 to rgb_u8	6.4±0.40µs	6.4±0.41µs	0.00%
Rgb family/rgb to hsl	820.3±54.94ns	842.7±57.08ns	+2.73%
Rgb family/rgb to hsv	646.8±37.43ns	681.2±155.10ns	+5.32%
Rgb family/rgb to linsrgb	5.5±0.30µs	5.7±0.38µs	+3.64%
Rgb family/rgb_u8 to linsrgb_f32	5.9±0.35µs	6.0±0.37µs	+1.69%
Rgb family/xyz to linsrgb	5.3±0.45µs	5.4±0.32µs	+1.89%

Ogeon · 2022-04-03T14:27:19Z

Looks like the performance gain varies from nothing to several times faster, depending on the work. Converting sRGB to linear RGB is even a bit slower on my machine (possibly due to the powf implementation), converting RGB to HSV or HSL is slightly faster if I use f32x8 but almost equal with f32x4, and converting between XYZ and RGB scales pretty good with the amount of lanes. My CPU is not particularly new, though, so YMMV. As always, with performance.

I don't think I will go through and optimize everything now. Just making sure there's any improvement at all.

Ogeon · 2022-04-03T14:40:42Z

The benchmark fails because the wide feature isn't on master. But the logs show similar results. And it's pretty cool that it keeps on being feasible to run these benchmarks here!

Ogeon · 2022-04-03T15:02:40Z

bors r+

bors · 2022-04-03T15:13:55Z

Build succeeded:

ci

Ogeon changed the title ~~Implement SIMD support and add wide integration~~ Implement SIMD support and add wide integration Apr 2, 2022

Implement SIMD support and add wide integration

d9e41a4

Ogeon force-pushed the simd_support branch from be2f215 to d9e41a4 Compare April 2, 2022 14:15

Ogeon force-pushed the simd_support branch from 0f9ee6f to ad67628 Compare April 2, 2022 15:21

Try to improve performance for RGB -> HSV and RGB -> HSL

2252fed

Ogeon force-pushed the simd_support branch from ad67628 to 2252fed Compare April 2, 2022 16:53

Add a few behchmars with wide and speed up a couple of cases

23a4ed0

Ogeon force-pushed the simd_support branch from d4b5824 to 23a4ed0 Compare April 3, 2022 14:38

bors bot merged commit 94e3073 into master Apr 3, 2022

bors bot deleted the simd_support branch April 3, 2022 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SIMD support and add `wide` integration #278

Implement SIMD support and add `wide` integration #278

Ogeon commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

Ogeon commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

Ogeon commented Apr 3, 2022 •

edited

Loading

Ogeon commented Apr 3, 2022

Ogeon commented Apr 3, 2022

bors bot commented Apr 3, 2022

Implement SIMD support and add wide integration #278

Implement SIMD support and add wide integration #278

Conversation

Ogeon commented Apr 2, 2022

Breaking Change

github-actions bot commented Apr 2, 2022

Benchmark for 780844c

github-actions bot commented Apr 2, 2022

Benchmark for 7787441

github-actions bot commented Apr 2, 2022

Benchmark for 50c6381

Ogeon commented Apr 2, 2022

github-actions bot commented Apr 2, 2022

Benchmark for 48c254f

Ogeon commented Apr 3, 2022 • edited Loading

Ogeon commented Apr 3, 2022

Ogeon commented Apr 3, 2022

bors bot commented Apr 3, 2022

Implement SIMD support and add `wide` integration #278

Implement SIMD support and add `wide` integration #278

Benchmark for `780844c`

Benchmark for `7787441`

Benchmark for `50c6381`

Benchmark for `48c254f`

Ogeon commented Apr 3, 2022 •

edited

Loading