-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-based implementations #109
base: master
Are you sure you want to change the base?
Conversation
68b2c5d
to
b7c30ee
Compare
I also combined the tests, to check all variants directly in correctness, especially for different widths, as they all use the same SIMD implementation. I would postpone the other implementations to a later PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to see! A couple thoughts:
- The
Implementation
API will likely change, as it's currently a breaking change. See Add non-breaking API for custom implementations #115 for a POC of the new API. It doesn't really affect this PR, but just an FYI that I might refactor it post-merge. - I'd prefer to use
bytemuck
for all the transmutation stuff. The current code is likely safe, but I'd be more comfortable with maintaining fewerunsafe
blocks.
src/crc16/simd.rs
Outdated
table: ( | ||
crc16_table_slice_16(algorithm.width, algorithm.poly, algorithm.refin), | ||
unsafe { | ||
// SAFETY: Both represent numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the safety comment provide a bit more info? E.g: "This is safe since [u64; 8] has the same representation as [__m128i; 4]". Also, since this transmute is used in a couple places, I think it's worth moving to x86.rs
with a safe encapsulation. Or just use bytemuck
.
src/crc16/simd.rs
Outdated
return update_slice16(crc, self.algorithm.refin, &self.table.0, bytes); | ||
} | ||
|
||
// SAFETY: Both represent numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth using bytemuck
here? It supports SIMD types: https://docs.rs/bytemuck/latest/bytemuck/trait.Pod.html#impl-Pod-for-__m128i.
src/lib.rs
Outdated
))] | ||
impl<W: Width> crate::Implementation for Simd<W> { | ||
type Width = W; | ||
type Table = ([[W; 256]; 16], [simd::SimdValue; 4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a smaller table for the remaining bytes? Or perhaps no table? I'd love to make the SIMD impl default (based on some compile-time/runtime feature detection), and that's only possible if there's no increase in memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to just use no table (only the 8 64-bit coefficients) as its determined at compile time which implementation to use, but this requires that all crc variants 32-1 bit width are working.
But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.
Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all crc variants 32-1 bit width are working
I believe that's the case.
But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.
That sgtm as well.
Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.
Compile-time is fine for now, we can add run-time detection later. Compile-time is useful for embedded x86 where you might not be able to fit the tables in ROM.
src/simd/x86.rs
Outdated
impl SimdValueOps for SimdValue { | ||
#[inline] | ||
#[target_feature(enable = "sse2")] | ||
unsafe fn xor(self, value: u64) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like safe interfaces. What are the invariants that need to be held to use these functions safely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are safe if the target supports them. Intrinsics are forced to be emitted even if the target doesn't support them, therefore its required to either check at run or compile-time if they are actually supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If SimdValue
is gated at compile-time, then I think a safe abstraction layer might be useful. Somewhat similar to https://docs.rs/safe_arch/latest/safe_arch/struct.m128i.html.
Sorry for the churn, but I just merged the new |
Ah nice, no problem, I'll take a look today, with the feedback given |
…ss tests (as the all test doesn't work for simd because of the amount of data it needs)
I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd? And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn. |
Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl? |
I prefer Simd.
Ah, good point. Then, I think not touching the default behavior is fine. |
I think we should gate the Simd impl behind the platform/target-features and leave the default as is. Once runtime detection is added, this can be revisited. |
Yep, Simd is probably easier to understand. Open for future PRs would be:
can all be done in a non-breaking way. |
…ation and therefore uses the alignment of the destination
The x86 implementation is based on Intel's paper about "Fast CRC Computation for Generic Polynomials using PCLMULQDQ Instruction"
It's about 4 times faster then Slice16, it's implementable for all algorithms, and theoretically requires no table (the remaining bytes could use nolookup). I know that crc32fast exists, but its not configurable, manually calculating the constants is annoying, and this implementation is about 1GB/s faster (crc32fast uses unaligned memory access)
SIMD will only be used when
Crc<Simd<W>>
is used, and supported by the target-features specified when compiling.TODO