Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-based implementations #109

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

SIMD-based implementations #109

wants to merge 4 commits into from

Conversation

valaphee
Copy link

@valaphee valaphee commented Nov 11, 2023

The x86 implementation is based on Intel's paper about "Fast CRC Computation for Generic Polynomials using PCLMULQDQ Instruction"

baseline/baseline       time:   [563.00 ns 564.20 ns 565.65 ns]
                        thrpt:  [26.976 GiB/s 27.045 GiB/s 27.103 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
crc32/default           time:   [31.258 µs 31.264 µs 31.269 µs]
                        thrpt:  [499.70 MiB/s 499.78 MiB/s 499.87 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
crc32/nolookup          time:   [131.55 µs 131.57 µs 131.60 µs]
                        thrpt:  [118.73 MiB/s 118.76 MiB/s 118.78 MiB/s]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe
crc32/bytewise          time:   [31.266 µs 31.271 µs 31.276 µs]
                        thrpt:  [499.59 MiB/s 499.67 MiB/s 499.74 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
crc32/slice16           time:   [4.6168 µs 4.6195 µs 4.6223 µs]
                        thrpt:  [3.3011 GiB/s 3.3031 GiB/s 3.3050 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
crc32/simd              time:   [983.88 ns 985.53 ns 987.60 ns]
                        thrpt:  [15.450 GiB/s 15.483 GiB/s 15.509 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

It's about 4 times faster then Slice16, it's implementable for all algorithms, and theoretically requires no table (the remaining bytes could use nolookup). I know that crc32fast exists, but its not configurable, manually calculating the constants is annoying, and this implementation is about 1GB/s faster (crc32fast uses unaligned memory access)

SIMD will only be used when Crc<Simd<W>> is used, and supported by the target-features specified when compiling.

TODO

  • 8 Normal domain
  • 16 Normal domain
  • 32 Normal domain
  • 64 Normal domain
  • 8 Bit-reflected domain
  • 16 Bit-reflected domain
  • 32 Bit-reflected domain
  • 64 Bit-reflected domain
  • x86
  • arm (NEON)

@valaphee valaphee changed the title PoC: SIMD-based implementation Draft PoC: SIMD-based implementation Nov 11, 2023
@valaphee valaphee marked this pull request as draft December 8, 2023 17:20
@valaphee valaphee changed the title Draft PoC: SIMD-based implementation SIMD-based implementations Dec 8, 2023
@valaphee valaphee force-pushed the simd branch 4 times, most recently from 68b2c5d to b7c30ee Compare March 23, 2024 00:49
@valaphee valaphee marked this pull request as ready for review March 23, 2024 00:50
@valaphee
Copy link
Author

valaphee commented Mar 23, 2024

I also combined the tests, to check all variants directly in correctness, especially for different widths, as they all use the same SIMD implementation.

I would postpone the other implementations to a later PR.

Copy link
Collaborator

@akhilles akhilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to see! A couple thoughts:

  • The Implementation API will likely change, as it's currently a breaking change. See Add non-breaking API for custom implementations #115 for a POC of the new API. It doesn't really affect this PR, but just an FYI that I might refactor it post-merge.
  • I'd prefer to use bytemuck for all the transmutation stuff. The current code is likely safe, but I'd be more comfortable with maintaining fewer unsafe blocks.

table: (
crc16_table_slice_16(algorithm.width, algorithm.poly, algorithm.refin),
unsafe {
// SAFETY: Both represent numbers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the safety comment provide a bit more info? E.g: "This is safe since [u64; 8] has the same representation as [__m128i; 4]". Also, since this transmute is used in a couple places, I think it's worth moving to x86.rs with a safe encapsulation. Or just use bytemuck.

return update_slice16(crc, self.algorithm.refin, &self.table.0, bytes);
}

// SAFETY: Both represent numbers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/lib.rs Outdated
))]
impl<W: Width> crate::Implementation for Simd<W> {
type Width = W;
type Table = ([[W; 256]; 16], [simd::SimdValue; 4]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a smaller table for the remaining bytes? Or perhaps no table? I'd love to make the SIMD impl default (based on some compile-time/runtime feature detection), and that's only possible if there's no increase in memory usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to just use no table (only the 8 64-bit coefficients) as its determined at compile time which implementation to use, but this requires that all crc variants 32-1 bit width are working.

But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.

Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all crc variants 32-1 bit width are working

I believe that's the case.

But for now I could also use the normal impl with the 256-entry table + 8 64 bit coefficients. This would also enable doing the feature detection at run-time.

That sgtm as well.

Don't know if its better to do it compile-time and maybe have to store no table at all (only the coeff.) or run-time with maybe a tiny performance impact for the detection and having to store a table at all.

Compile-time is fine for now, we can add run-time detection later. Compile-time is useful for embedded x86 where you might not be able to fit the tables in ROM.

src/simd/x86.rs Outdated
impl SimdValueOps for SimdValue {
#[inline]
#[target_feature(enable = "sse2")]
unsafe fn xor(self, value: u64) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like safe interfaces. What are the invariants that need to be held to use these functions safely?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are safe if the target supports them. Intrinsics are forced to be emitted even if the target doesn't support them, therefore its required to either check at run or compile-time if they are actually supported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SimdValue is gated at compile-time, then I think a safe abstraction layer might be useful. Somewhat similar to https://docs.rs/safe_arch/latest/safe_arch/struct.m128i.html.

@akhilles
Copy link
Collaborator

akhilles commented Apr 2, 2024

Sorry for the churn, but I just merged the new Implementation API (#115). There shouldn't bee too many conflicts with this PR, but I can help rebase it if you'd like.

@valaphee
Copy link
Author

valaphee commented Apr 2, 2024

Ah nice, no problem, I'll take a look today, with the feedback given

…ss tests (as the all test doesn't work for simd because of the amount of data it needs)
@valaphee
Copy link
Author

valaphee commented Apr 3, 2024

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

@valaphee
Copy link
Author

valaphee commented Apr 3, 2024

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

@akhilles
Copy link
Collaborator

akhilles commented Apr 4, 2024

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

I prefer Simd.

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

Ah, good point. Then, I think not touching the default behavior is fine.

@akhilles
Copy link
Collaborator

akhilles commented Apr 4, 2024

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

I think we should gate the Simd impl behind the platform/target-features and leave the default as is.

Once runtime detection is added, this can be revisited.

@valaphee
Copy link
Author

valaphee commented Apr 5, 2024

Yep, Simd is probably easier to understand.

Open for future PRs would be:

can all be done in a non-breaking way.

…ation and therefore uses the alignment of the destination
@valaphee valaphee requested a review from akhilles April 15, 2024 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants