-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSE2/AVX2/WASM SIMD support #86
Conversation
Note: I think this currently breaks serde support. |
This reverts commit 5282a4d.
Reran benchmarks, this seems to be a significant gain on multiple fronts when the right CPU features are enabled during compilation:
The Mask changes might need to be reverted though. The regressions in |
// SAFETY: This is using the exact same allocation pattern, size, and capacity | ||
// making this reconstruction of the Vec safe. | ||
let mut data = unsafe { | ||
let mut data = ManuallyDrop::new(self.data); | ||
let ptr = data.as_mut_ptr().cast(); | ||
let len = data.len() * SimdBlock::USIZE_COUNT; | ||
let capacity = data.capacity() * SimdBlock::USIZE_COUNT; | ||
Vec::from_raw_parts(ptr, len, capacity) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is safe. The Vec
is initially allocated with SimdBlock
s, which can have a different alignment than usize
s. For example the one in avx2
has an alignment of 32 and the one in sse2
has an alignment of 16, while usize
only has an alignment of 8.
Fixes #73. Changes
Block
from a alias to a platform/target-specific newtype aroundusize
,__m128i
,__m256i
, orv128
.This supports all SIMD intrinsics that have been stabilized into the standard library.
SSE2 is universally available on all x86_64 machines, so this should see a 4x speedup relative to the u32-based approach originally used before #74.
AVX2 is only on~89% of consumer machines, so it may not be fully reliable, but should show another 2x speedup over SSE2. Those who are using this in a cloud or server environment will likely benefit from using
--target-cpu=native
, which should enable it on the target machine.NOTE: This adds a lot of unsafe code, simply by nature of using SIMD intrinsics. There's a good chunk of
core::mem::transmute
going around too, though I try to keep it to a minimum.Performance
Using a ported version of the benchmarks to Criterion (see #84), on set or batch operations, like
insert_range
,intersection_with
, etc. these SIMD accelerated versions are expectedly 2-4 times faster than when usingusize
as the block, which should extend the performance gains of #74 even further.I tested optionally using runtime feature detection via
ix_x86_feature_detected
on these operations, and unfortunately that causes serious regressions.