-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Rust] Add explicit SIMD vectorization for arithmetic ops in "array_ops" #3451
Conversation
@paddyhoran This PR includes some commits in the diff view that should not come up. Can you rebase on master? |
Yep, sorry about that. I'll rebase when I get a chance. |
@paddyhoran This is looking good. I don't have experience with SIMD yet but it was on my list to learn so this seems like a good opportunity .. I will start testing this. I like the |
Thanks for the work @paddyhoran ! I'll take a look at this too. |
Hold off for a little, I'm trying to clean it up tonight. I'll post what I have to get your opinions. |
I don't have any SIMD experience, I however 👍 the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paddyhoran . I also think packed_simd
is a good choice for now, for this type of purpose. Left a few comments.
rust/arrow/src/array.rs
Outdated
let raw = unsafe { std::slice::from_raw_parts(self.raw_values(), self.len()) }; | ||
&raw[offset..offset + len] | ||
let raw = | ||
unsafe { std::slice::from_raw_parts(self.raw_values().offset(offset as isize), len) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says that it does not do bounds checking but I found that it did.
+ Div<Output = T::Native> | ||
+ Zero, | ||
{ | ||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this runtime detection? e.g.:
if is_x86_feature_detected!("avx2") {
return add_simd(&left, &right);
} else {
math_op(left, right, |a, b| Ok(a + b))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below.
rust/arrow/src/datatypes.rs
Outdated
|
||
/// Performs a SIMD add operation | ||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
fn add(left: Self::Simd, right: Self::Simd) -> Self::Simd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this general to all math operations? e.g., +, -, *, /. Seems they are supported by the simd type.
for i in (0..left.len()).step_by(lanes) { | ||
let simd_left = T::load(left.value_slice(i, lanes)); | ||
let simd_right = T::load(right.value_slice(i, lanes)); | ||
let simd_result = T::add(simd_left, simd_right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to handle nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have to work on nulls
@@ -17,6 +17,7 @@ | |||
|
|||
pub mod array; | |||
pub mod array_data; | |||
pub mod compute; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need this file anymore since we have lib.rs
.
/// available. | ||
pub trait ArrowNumericType: ArrowPrimitiveType { | ||
/// Defines the SIMD type that should be used for this numeric type | ||
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this above the trait definition? also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType
really is almost all about SIMD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked wrong to have #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
in multiple places, I plan to go back and see how to clean this up.
also, I'm not sure if we should define another trait just for SIMD, since now ArrowNumericType really is almost all about SIMD.
I don't quite understand what you mean here? What do you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I'm proposing to perhaps add another trait such as ArrowSIMDType
, just for the SIMD purpose. Let me know if this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that probably does make sense as eventually we will need SIMD ops over Boolean Arrays as well.
rust/arrow/src/datatypes.rs
Outdated
}; | ||
} | ||
|
||
make_numeric_type!(Int8Type, i8, i8x64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered AVX-512? also wondering if it would be possible to support both depending on the architecture...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought that packed_simd
only supported up to 256, I need to update this.
I wanted to leave a note on the general direction. At first I wanted runtime detection of the "best" intrinsics. However, after some research there is no "best" solution, it depends on the situation. Here is what I'm planning. Support the larges SIMD registers available in
I believe that sse is available on all intel cpu's and if you do not use I think we should add runtime detection via a feature flag in another jira. Runtime detection is not always worth using as even though you may have access to intrinsics with wider registers the memory bandwidth of your cpu won't really allow you to take advantage of it, that's the situation I'm in on my dev machine. In this case you might end up checking for different intrinsics at runtime even though the most basic sse version is just as fast. |
Sure. I'm OK with adding the runtime detection later.
Thanks. This is interesting to know. Do you have any benchmark number, or any reference that explains this? |
The best resource is the issue I opened in |
# Conflicts: # rust/arrow/src/compute/array_ops.rs
Closing in favor of smaller incremental PR's. |
@andygrove @sunchao this is nowhere near done but I did want to get your opinions on two items before I go further:
the choice of
packed_simd
. There are other options out there and some try to provide a higher level api. However, I feel that ecosystem has not matured enough yet to the point that any single third party library is the clear choice for SIMD in rust and therefore is not worth picking it up as a dependency at this time. In arrow we will always be working with packed vectors, couple this with the fact that the objective ofpacked_simd
to to get stablized in the future and I think thatpacked_simd
is a good choice. Alternatively, we could you the raw intrinsics instd::arch
.Re-organization of what is called
array_ops
into thecompute
sub module. This is for two reasons. Although I will try to make the SIMD optimized versions of the code as easy to use as possible (with run-time detection, etc.), the actual implementation of SIMD code tends to be a little verbose as we will want to conditionally compile different versions for different cpu's. The C++ version is structured this way with acompute
sub module.