SIMD-based implementations #109

valaphee · 2023-11-11T21:29:03Z

The x86 implementation is based on Intel's paper about "Fast CRC Computation for Generic Polynomials using PCLMULQDQ Instruction"

baseline/baseline       time:   [563.00 ns 564.20 ns 565.65 ns]
                        thrpt:  [26.976 GiB/s 27.045 GiB/s 27.103 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
crc32/default           time:   [31.258 µs 31.264 µs 31.269 µs]
                        thrpt:  [499.70 MiB/s 499.78 MiB/s 499.87 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
crc32/nolookup          time:   [131.55 µs 131.57 µs 131.60 µs]
                        thrpt:  [118.73 MiB/s 118.76 MiB/s 118.78 MiB/s]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe
crc32/bytewise          time:   [31.266 µs 31.271 µs 31.276 µs]
                        thrpt:  [499.59 MiB/s 499.67 MiB/s 499.74 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
crc32/slice16           time:   [4.6168 µs 4.6195 µs 4.6223 µs]
                        thrpt:  [3.3011 GiB/s 3.3031 GiB/s 3.3050 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
crc32/simd              time:   [983.88 ns 985.53 ns 987.60 ns]
                        thrpt:  [15.450 GiB/s 15.483 GiB/s 15.509 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

It's about 4 times faster then Slice16, it's implementable for all algorithms, and theoretically requires no table (the remaining bytes could use nolookup). I know that crc32fast exists, but its not configurable, manually calculating the constants is annoying, and this implementation is about 1GB/s faster (crc32fast uses unaligned memory access)

SIMD will only be used when Crc<Simd<W>> is used, and supported by the target-features specified when compiling.

TODO

valaphee · 2024-03-23T00:59:37Z

I also combined the tests, to check all variants directly in correctness, especially for different widths, as they all use the same SIMD implementation.

I would postpone the other implementations to a later PR.

akhilles

This is great to see! A couple thoughts:

The Implementation API will likely change, as it's currently a breaking change. See Add non-breaking API for custom implementations #115 for a POC of the new API. It doesn't really affect this PR, but just an FYI that I might refactor it post-merge.
I'd prefer to use bytemuck for all the transmutation stuff. The current code is likely safe, but I'd be more comfortable with maintaining fewer unsafe blocks.

akhilles · 2024-03-30T23:26:28Z

src/crc16/simd.rs

+            table: (
+                crc16_table_slice_16(algorithm.width, algorithm.poly, algorithm.refin),
+                unsafe {
+                    // SAFETY: Both represent numbers.


Could the safety comment provide a bit more info? E.g: "This is safe since [u64; 8] has the same representation as [__m128i; 4]". Also, since this transmute is used in a couple places, I think it's worth moving to x86.rs with a safe encapsulation. Or just use bytemuck.

akhilles · 2024-03-30T23:30:50Z

src/crc16/simd.rs

+            return update_slice16(crc, self.algorithm.refin, &self.table.0, bytes);
+        }
+
+        // SAFETY: Both represent numbers.


Is it worth using bytemuck here? It supports SIMD types: https://docs.rs/bytemuck/latest/bytemuck/trait.Pod.html#impl-Pod-for-__m128i.

akhilles · 2024-03-30T23:39:20Z

src/lib.rs

+))]
+impl<W: Width> crate::Implementation for Simd<W> {
+    type Width = W;
+    type Table = ([[W; 256]; 16], [simd::SimdValue; 4]);


Can we use a smaller table for the remaining bytes? Or perhaps no table? I'd love to make the SIMD impl default (based on some compile-time/runtime feature detection), and that's only possible if there's no increase in memory usage.

It should be possible to just use no table (only the 8 64-bit coefficients) as its determined at compile time which implementation to use, but this requires that all crc variants 32-1 bit width are working.

But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.

Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.

all crc variants 32-1 bit width are working

I believe that's the case.

But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.

That sgtm as well.

Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.

Compile-time is fine for now, we can add run-time detection later. Compile-time is useful for embedded x86 where you might not be able to fit the tables in ROM.

akhilles · 2024-03-30T23:42:47Z

src/simd/x86.rs

+impl SimdValueOps for SimdValue {
+    #[inline]
+    #[target_feature(enable = "sse2")]
+    unsafe fn xor(self, value: u64) -> Self {


These seem like safe interfaces. What are the invariants that need to be held to use these functions safely?

They are safe if the target supports them. Intrinsics are forced to be emitted even if the target doesn't support them, therefore its required to either check at run or compile-time if they are actually supported.

If SimdValue is gated at compile-time, then I think a safe abstraction layer might be useful. Somewhat similar to https://docs.rs/safe_arch/latest/safe_arch/struct.m128i.html.

akhilles · 2024-04-02T05:58:10Z

Sorry for the churn, but I just merged the new Implementation API (#115). There shouldn't bee too many conflicts with this PR, but I can help rebase it if you'd like.

valaphee · 2024-04-02T18:07:18Z

Ah nice, no problem, I'll take a look today, with the feedback given

…ss tests (as the all test doesn't work for simd because of the amount of data it needs)

valaphee · 2024-04-03T09:57:31Z

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

valaphee · 2024-04-03T09:59:12Z

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

akhilles · 2024-04-04T15:56:37Z

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

I prefer Simd.

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

Ah, good point. Then, I think not touching the default behavior is fine.

akhilles · 2024-04-04T15:59:23Z

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

I think we should gate the Simd impl behind the platform/target-features and leave the default as is.

Once runtime detection is added, this can be revisited.

valaphee · 2024-04-05T16:54:31Z

Yep, Simd is probably easier to understand.

Open for future PRs would be:

ARM (pmull2) support
CRC1..32 non-reflex support
CRC33..128 support
Runtime feature detection (requires std or https://github.com/RustCrypto/utils/tree/master/cpufeatures)

can all be done in a non-breaking way.

…ation and therefore uses the alignment of the destination

valaphee changed the title ~~PoC: SIMD-based implementation~~ Draft PoC: SIMD-based implementation Nov 11, 2023

valaphee marked this pull request as draft December 8, 2023 17:20

valaphee changed the title ~~Draft PoC: SIMD-based implementation~~ SIMD-based implementations Dec 8, 2023

valaphee mentioned this pull request Dec 9, 2023

Make it possible to customize CRC polynomial and order srijs/rust-crc32fast#9

Open

valaphee force-pushed the simd branch 4 times, most recently from 68b2c5d to b7c30ee Compare March 23, 2024 00:49

valaphee marked this pull request as ready for review March 23, 2024 00:50

valaphee force-pushed the simd branch from b7c30ee to 3df2166 Compare March 23, 2024 01:14

akhilles reviewed Mar 30, 2024

View reviewed changes

Implement clmul for crc8, 16, 32, test for all cases in the correctne…

79e1c40

…ss tests (as the all test doesn't work for simd because of the amount of data it needs)

valaphee force-pushed the simd branch from bf350b5 to 79e1c40 Compare April 3, 2024 09:56

Rename clmul to simd

677a3b3

Moving mask to const, eliminating transmutes and ensuring const evalu…

82b4ac8

…ation and therefore uses the alignment of the destination

valaphee force-pushed the simd branch from ac7673c to 82b4ac8 Compare April 5, 2024 17:16

Fix doc comment

07acda3

valaphee requested a review from akhilles April 15, 2024 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD-based implementations #109

SIMD-based implementations #109

valaphee commented Nov 11, 2023 •

edited

Loading

valaphee commented Mar 23, 2024 •

edited

Loading

akhilles left a comment

akhilles Mar 30, 2024

akhilles Mar 30, 2024

akhilles Mar 30, 2024

valaphee Apr 1, 2024

akhilles Apr 1, 2024

akhilles Mar 30, 2024

valaphee Apr 1, 2024

akhilles Apr 1, 2024

akhilles commented Apr 2, 2024

valaphee commented Apr 2, 2024 •

edited

Loading

valaphee commented Apr 3, 2024 •

edited

Loading

valaphee commented Apr 3, 2024 •

edited

Loading

akhilles commented Apr 4, 2024

akhilles commented Apr 4, 2024

valaphee commented Apr 5, 2024 •

edited

Loading

SIMD-based implementations #109

Are you sure you want to change the base?

SIMD-based implementations #109

Conversation

valaphee commented Nov 11, 2023 • edited Loading

valaphee commented Mar 23, 2024 • edited Loading

akhilles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akhilles commented Apr 2, 2024

valaphee commented Apr 2, 2024 • edited Loading

valaphee commented Apr 3, 2024 • edited Loading

valaphee commented Apr 3, 2024 • edited Loading

akhilles commented Apr 4, 2024

akhilles commented Apr 4, 2024

valaphee commented Apr 5, 2024 • edited Loading

valaphee commented Nov 11, 2023 •

edited

Loading

valaphee commented Mar 23, 2024 •

edited

Loading

valaphee commented Apr 2, 2024 •

edited

Loading

valaphee commented Apr 3, 2024 •

edited

Loading

valaphee commented Apr 3, 2024 •

edited

Loading

valaphee commented Apr 5, 2024 •

edited

Loading