Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SIMD support and add wide integration #278

Merged
merged 3 commits into from
Apr 3, 2022
Merged

Conversation

Ogeon
Copy link
Owner

@Ogeon Ogeon commented Apr 2, 2022

This adds initial support for SIMD types in most places. An exception is the Luv related types, where the conversion logic need extra attention. Some of the conversions aren't necessarily optimal but the focus was on making it work at all.

Integration with the wide crate has been added behind a feature flag, as a first example. More SIMD crates can be added in the future.

Breaking Change

Some functions that used to return bool is now returning a mask type. This mask type is still bool for regular floats and ints, so this change will mostly affect generic code. GetHue was also changed to no longer return Option<T> for SIMD friendliness.

@Ogeon Ogeon changed the title Implement SIMD support and add wide integration Implement SIMD support and add wide integration Apr 2, 2022
@github-actions
Copy link

github-actions bot commented Apr 2, 2022

Benchmark for 780844c

Click to view benchmark
Test Base PR %
Cie family/lab to lch 2.9±0.07µs 2.9±0.08µs 0.00%
Cie family/lab to xyz 733.0±15.20ns 732.5±15.26ns -0.07%
Cie family/lch to lab 2.1±0.05µs 2.1±0.05µs 0.00%
Cie family/linsrgb to xyz 3.3±0.06µs 3.2±0.07µs -3.03%
Cie family/xyz to lab 16.4±0.32µs 16.4±0.47µs 0.00%
Cie family/xyz to yxy 554.9±14.91ns 473.2±9.12ns -14.72%
Cie family/yxy to xyz 473.3±16.92ns 446.1±8.45ns -5.75%
Matrix functions/matrix_inverse 9.6±0.33ns 9.3±0.19ns -3.12%
Matrix functions/multiply_3x3 12.8±0.26ns 12.8±0.32ns 0.00%
Matrix functions/multiply_rgb_to_xyz 5.9±0.14ns 5.9±0.24ns 0.00%
Matrix functions/multiply_xyz 5.9±0.25ns 5.9±0.20ns 0.00%
Matrix functions/multiply_xyz_to_rgb 5.9±0.15ns 5.9±0.17ns 0.00%
Matrix functions/rgb_to_xyz_matrix 20.1±0.38ns 20.2±0.77ns +0.50%
Rgb family/hsl to hsv 556.0±17.99ns 556.6±20.13ns +0.11%
Rgb family/hsl to linear hsl 8.8±0.17µs 10.4±0.20µs +18.18%
Rgb family/hsl to rgb 2.0±0.05µs 2.1±0.04µs +5.00%
Rgb family/hsv to hsl 936.2±19.63ns 1261.8±24.21ns +34.78%
Rgb family/hsv to hwb 205.4±3.92ns 205.8±4.61ns +0.19%
Rgb family/hsv to linear hsv 8.8±0.20µs 9.9±0.37µs +12.50%
Rgb family/hsv to rgb 1996.5±52.13ns 2.0±0.05µs +0.18%
Rgb family/hwb to hsv 425.7±8.34ns 425.8±9.23ns +0.02%
Rgb family/hwb to linear hwb 9.9±0.29µs 10.4±0.42µs +5.05%
Rgb family/linear hsl to hsl 10.0±0.40µs 11.6±0.25µs +16.00%
Rgb family/linear hsv to hsv 9.0±0.20µs 11.0±0.32µs +22.22%
Rgb family/linear hwb to hwb 10.0±0.23µs 11.6±0.46µs +16.00%
Rgb family/linsrgb to rgb 5.5±0.13µs 5.5±0.12µs 0.00%
Rgb family/linsrgb_f32 to rgb_u8 6.1±0.13µs 6.1±0.19µs 0.00%
Rgb family/rgb to hsl 746.6±13.20ns 1216.8±33.13ns +62.98%
Rgb family/rgb to hsv 603.3±14.15ns 1152.6±30.72ns +91.05%
Rgb family/rgb to linsrgb 5.2±0.12µs 5.2±0.12µs 0.00%
Rgb family/rgb_u8 to linsrgb_f32 5.7±0.12µs 5.7±0.25µs 0.00%
Rgb family/xyz to linsrgb 5.0±0.10µs 5.0±0.23µs 0.00%

@github-actions
Copy link

github-actions bot commented Apr 2, 2022

Benchmark for 7787441

Click to view benchmark
Test Base PR %
Cie family/lab to lch 3.3±0.09µs 3.3±0.05µs 0.00%
Cie family/lab to xyz 829.1±12.54ns 829.8±11.34ns +0.08%
Cie family/lch to lab 2.4±0.04µs 2.4±0.04µs 0.00%
Cie family/linsrgb to xyz 3.7±0.06µs 3.7±0.07µs 0.00%
Cie family/xyz to lab 18.6±0.41µs 18.6±0.53µs 0.00%
Cie family/xyz to yxy 632.6±21.42ns 534.1±9.35ns -15.57%
Cie family/yxy to xyz 532.5±8.47ns 504.6±7.63ns -5.24%
Matrix functions/matrix_inverse 10.5±0.18ns 10.5±0.14ns 0.00%
Matrix functions/multiply_3x3 14.5±0.37ns 14.5±0.20ns 0.00%
Matrix functions/multiply_rgb_to_xyz 6.6±0.12ns 6.6±0.15ns 0.00%
Matrix functions/multiply_xyz 6.6±0.11ns 6.6±0.11ns 0.00%
Matrix functions/multiply_xyz_to_rgb 6.6±0.12ns 6.6±0.08ns 0.00%
Matrix functions/rgb_to_xyz_matrix 22.8±0.42ns 23.0±1.43ns +0.88%
Rgb family/hsl to hsv 624.6±8.23ns 587.1±8.71ns -6.00%
Rgb family/hsl to linear hsl 10.0±0.15µs 11.6±0.23µs +16.00%
Rgb family/hsl to rgb 2.3±0.03µs 2.4±0.06µs +4.35%
Rgb family/hsv to hsl 1045.7±19.40ns 1340.4±31.16ns +28.18%
Rgb family/hsv to hwb 232.9±5.72ns 232.4±3.39ns -0.21%
Rgb family/hsv to linear hsv 10.0±0.20µs 11.0±0.26µs +10.00%
Rgb family/hsv to rgb 2.3±0.04µs 2.3±0.04µs 0.00%
Rgb family/hwb to hsv 482.8±8.75ns 482.8±8.13ns 0.00%
Rgb family/hwb to linear hwb 11.2±0.22µs 11.5±0.15µs +2.68%
Rgb family/linear hsl to hsl 11.3±0.20µs 13.1±0.21µs +15.93%
Rgb family/linear hsv to hsv 10.2±0.17µs 12.3±0.58µs +20.59%
Rgb family/linear hwb to hwb 11.3±0.17µs 12.9±0.24µs +14.16%
Rgb family/linsrgb to rgb 6.2±0.08µs 6.2±0.29µs 0.00%
Rgb family/linsrgb_f32 to rgb_u8 6.9±0.10µs 6.9±0.12µs 0.00%
Rgb family/rgb to hsl 835.5±17.86ns 1246.2±17.09ns +49.16%
Rgb family/rgb to hsv 687.4±14.40ns 1234.3±23.89ns +79.56%
Rgb family/rgb to linsrgb 6.0±0.14µs 6.0±0.13µs 0.00%
Rgb family/rgb_u8 to linsrgb_f32 6.4±0.09µs 6.4±0.12µs 0.00%
Rgb family/xyz to linsrgb 5.6±0.07µs 5.6±0.08µs 0.00%

@github-actions
Copy link

github-actions bot commented Apr 2, 2022

Benchmark for 50c6381

Click to view benchmark
Test Base PR %
Cie family/lab to lch 4.0±0.22µs 3.9±0.20µs -2.50%
Cie family/lab to xyz 1015.0±38.42ns 1008.9±45.37ns -0.60%
Cie family/lch to lab 2.9±0.28µs 2.9±0.12µs 0.00%
Cie family/linsrgb to xyz 4.4±0.13µs 4.5±0.17µs +2.27%
Cie family/xyz to lab 22.5±0.72µs 22.9±1.11µs +1.78%
Cie family/xyz to yxy 783.3±37.36ns 652.0±25.90ns -16.76%
Cie family/yxy to xyz 646.9±19.93ns 618.8±36.49ns -4.34%
Matrix functions/matrix_inverse 12.9±0.49ns 12.9±0.42ns 0.00%
Matrix functions/multiply_3x3 17.8±1.08ns 17.6±0.60ns -1.12%
Matrix functions/multiply_rgb_to_xyz 8.1±0.30ns 8.1±0.37ns 0.00%
Matrix functions/multiply_xyz 8.1±0.49ns 8.0±0.39ns -1.23%
Matrix functions/multiply_xyz_to_rgb 8.1±0.34ns 8.0±0.29ns -1.23%
Matrix functions/rgb_to_xyz_matrix 27.7±1.32ns 27.5±1.00ns -0.72%
Rgb family/hsl to hsv 760.1±30.94ns 761.8±30.63ns +0.22%
Rgb family/hsl to linear hsl 12.4±1.19µs 14.2±0.72µs +14.52%
Rgb family/hsl to rgb 2.8±0.11µs 2.9±0.33µs +3.57%
Rgb family/hsv to hsl 1274.7±48.60ns 1458.4±60.08ns +14.41%
Rgb family/hsv to hwb 284.5±14.22ns 283.2±8.62ns -0.46%
Rgb family/hsv to linear hsv 12.2±0.50µs 13.2±0.58µs +8.20%
Rgb family/hsv to rgb 2.8±0.14µs 2.7±0.10µs -3.57%
Rgb family/hwb to hsv 587.8±29.31ns 763.5±30.92ns +29.89%
Rgb family/hwb to linear hwb 13.6±0.56µs 14.2±0.60µs +4.41%
Rgb family/linear hsl to hsl 13.9±0.53µs 15.3±1.18µs +10.07%
Rgb family/linear hsv to hsv 12.6±0.66µs 15.6±0.66µs +23.81%
Rgb family/linear hwb to hwb 14.2±0.63µs 16.4±0.75µs +15.49%
Rgb family/linsrgb to rgb 7.5±0.25µs 7.6±0.37µs +1.33%
Rgb family/linsrgb_f32 to rgb_u8 8.3±0.23µs 8.3±0.31µs 0.00%
Rgb family/rgb to hsl 1037.9±47.12ns 1528.1±58.44ns +47.23%
Rgb family/rgb to hsv 830.8±30.53ns 1523.1±164.19ns +83.33%
Rgb family/rgb to linsrgb 7.3±0.42µs 7.3±0.41µs 0.00%
Rgb family/rgb_u8 to linsrgb_f32 7.8±0.39µs 8.0±0.61µs +2.56%
Rgb family/xyz to linsrgb 6.9±0.32µs 7.5±0.40µs +8.70%

@Ogeon
Copy link
Owner Author

Ogeon commented Apr 2, 2022

It's a bummer that the RGB to HSL and RGB to HSV conversion is so much slower. I'll try with the old one behind type ID checks (i.e. Great Value Specialization) for now and see if it works better. I should see if I can add benchmarks for the SIMD versions before merging this.

@github-actions
Copy link

github-actions bot commented Apr 2, 2022

Benchmark for 48c254f

Click to view benchmark
Test Base PR %
Cie family/lab to lch 3.2±0.17µs 3.1±0.21µs -3.13%
Cie family/lab to xyz 799.5±45.43ns 780.7±46.96ns -2.35%
Cie family/lch to lab 2.3±0.13µs 2.2±0.13µs -4.35%
Cie family/linsrgb to xyz 3.5±0.31µs 3.5±0.25µs 0.00%
Cie family/xyz to lab 16.9±0.93µs 18.3±2.05µs +8.28%
Cie family/xyz to yxy 608.9±34.74ns 524.6±114.27ns -13.84%
Cie family/yxy to xyz 511.8±30.11ns 481.6±28.70ns -5.90%
Matrix functions/matrix_inverse 9.8±0.63ns 9.7±0.55ns -1.02%
Matrix functions/multiply_3x3 13.5±0.87ns 13.5±1.42ns 0.00%
Matrix functions/multiply_rgb_to_xyz 6.3±0.41ns 6.3±0.40ns 0.00%
Matrix functions/multiply_xyz 6.1±0.36ns 5.9±0.27ns -3.28%
Matrix functions/multiply_xyz_to_rgb 6.3±0.37ns 6.2±0.38ns -1.59%
Matrix functions/rgb_to_xyz_matrix 21.1±2.25ns 21.4±1.78ns +1.42%
Rgb family/hsl to hsv 580.8±39.71ns 632.4±41.40ns +8.88%
Rgb family/hsl to linear hsl 9.3±0.59µs 10.4±1.43µs +11.83%
Rgb family/hsl to rgb 2.2±0.43µs 2.3±0.12µs +4.55%
Rgb family/hsv to hsl 1005.7±65.28ns 1218.5±65.26ns +21.16%
Rgb family/hsv to hwb 218.0±13.57ns 218.8±25.70ns +0.37%
Rgb family/hsv to linear hsv 9.8±2.69µs 9.5±0.87µs -3.06%
Rgb family/hsv to rgb 2.1±0.13µs 2.1±0.13µs 0.00%
Rgb family/hwb to hsv 450.9±31.24ns 548.0±32.73ns +21.53%
Rgb family/hwb to linear hwb 10.3±0.63µs 10.3±0.59µs 0.00%
Rgb family/linear hsl to hsl 10.7±0.67µs 10.8±0.94µs +0.93%
Rgb family/linear hsv to hsv 9.6±0.59µs 9.8±0.53µs +2.08%
Rgb family/linear hwb to hwb 10.8±0.65µs 10.6±1.00µs -1.85%
Rgb family/linsrgb to rgb 5.8±0.37µs 5.7±0.35µs -1.72%
Rgb family/linsrgb_f32 to rgb_u8 6.4±0.40µs 6.4±0.41µs 0.00%
Rgb family/rgb to hsl 820.3±54.94ns 842.7±57.08ns +2.73%
Rgb family/rgb to hsv 646.8±37.43ns 681.2±155.10ns +5.32%
Rgb family/rgb to linsrgb 5.5±0.30µs 5.7±0.38µs +3.64%
Rgb family/rgb_u8 to linsrgb_f32 5.9±0.35µs 6.0±0.37µs +1.69%
Rgb family/xyz to linsrgb 5.3±0.45µs 5.4±0.32µs +1.89%

@Ogeon
Copy link
Owner Author

Ogeon commented Apr 3, 2022

Looks like the performance gain varies from nothing to several times faster, depending on the work. Converting sRGB to linear RGB is even a bit slower on my machine (possibly due to the powf implementation), converting RGB to HSV or HSL is slightly faster if I use f32x8 but almost equal with f32x4, and converting between XYZ and RGB scales pretty good with the amount of lanes. My CPU is not particularly new, though, so YMMV. As always, with performance.

I don't think I will go through and optimize everything now. Just making sure there's any improvement at all.

@Ogeon
Copy link
Owner Author

Ogeon commented Apr 3, 2022

The benchmark fails because the wide feature isn't on master. But the logs show similar results. And it's pretty cool that it keeps on being feasible to run these benchmarks here!

@Ogeon
Copy link
Owner Author

Ogeon commented Apr 3, 2022

bors r+

@bors
Copy link
Contributor

bors bot commented Apr 3, 2022

Build succeeded:

@bors bors bot merged commit 94e3073 into master Apr 3, 2022
@bors bors bot deleted the simd_support branch April 3, 2022 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant