-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure sum
for better auto-vectorization for floats
#4560
Conversation
Interesting, I can reproduce the benchmark results on my machine (i9-11900KB with AVX-512), the version without simd feature is even slightly faster. Very nice improvement! With What cpu did you run your benchmarks on? |
Great! Do you mean that both the non-null and null version are competitive with the simd feature on your machine?
It's an i7-10750H. I used the following rustc: |
These are my results, simd feature is a tiny bit ahead, but maybe not enough to justify the additional code complexity.
|
I'll let you/others decide if we should replace the simd feature impl with the code in this PR. And if so, should this be done in this PR, or another one? I'll push a commit today it tomorrow that resolves the todo, picking a proper value for LANES based on T. |
I tried expanding the benchmarks in f472f3f, and then comparing before this PR (but with f472f3f) and this PR:
This is still on an i7-10750H, and Interestingly, the speedups are only for f32 with/without nulls, and ts_millis with nulls. For all others, it's not a speedup. In any case, some more investigation into the regressions are needed before this can be merged. |
Marking as draft whilst we work out the details. Feel free to mark ready for review when you would like me to take another look. FWIW when I was playing with this a few days ago in Godbolt, I found that nightly did a better job optimising the code than stable, and this was borne out by benchmarks. We may be in the unfortunate territory of LLVM being tempremental |
sum
for better auto-vectorizationsum
for better auto-vectorization for floats
I couldn't find a single implementation that would speed up both integer and floating point numbers, so I decided to have an implementation for both. Marked ready to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, thank you for sticking with this.
I'm not sure what your area of focus is, but FWIW following the recent improvements to grouping in DataFusion, it no longer actually uses these kernels as it now performs aggregation of all groups at once
| DataType::Decimal128(_, _) | ||
| DataType::Decimal256(_, _) => match T::lanes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is decimal here?
sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.0, "ts_millis no nulls"); | ||
sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.5, "ts_millis 50% nulls"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW arithmetic on timestamps as this does is not especially meaningful, adding two timestamps doesn't yield a timestamp, DurationMillisecondType
might be more meaningful
pub trait ArrowNumericType: ArrowPrimitiveType {} | ||
pub trait ArrowNumericType: ArrowPrimitiveType { | ||
/// The number of SIMD lanes available | ||
fn lanes() -> usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little off to define this for all the types, but then only use it for a special case of floats 🤔
| DataType::Decimal256(_, _) => match T::lanes() { | ||
1 => sum_impl_floating::<T, 1>(array), | ||
2 => sum_impl_floating::<T, 2>(array), | ||
4 => sum_impl_floating::<T, 4>(array), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we have 3 floating point types, we could just dispatch to sum_impl_floating with the appropriate constant specified, without needing ArrowNumericType?
@@ -285,44 +285,178 @@ where | |||
return None; | |||
} | |||
|
|||
let data: &[T::Native] = array.values(); | |||
fn sum_impl_integer<T>(array: &PrimitiveArray<T>) -> Option<T::Native> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if you changed the signature to
fn sum_impl_integer<T: ArrowNativeType>(values: &[T], nulls: Option<&NullBuffer>) -> Option<T>
It would potentially save on codegen, as it would be instantiated per native type not per primitive type
sum = sum.add_wrapping(*value); | ||
} | ||
|
||
fn sum_impl_floating<T, const LANES: usize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
} | ||
} | ||
|
||
match T::DATA_TYPE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This match block is kind of grim, but I don't have a better solution off the top of my head... Perhaps some sort of trait 🤔
Marking this as a draft to make clear it isn't awaiting review, feel free to unmark when you would like me to take another look |
This code has been incorporated into #5100 and merged, thank you for starting this process |
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Restructure the code for
sum
to allow for better auto-vectorization.The code in the
simd
module is very similar, but it depends onpacked_simd
.The auto-vectorized non-null case now has the same performance the
simd
feature impl. See benchmarks.I didn't manage to make the null case quite as fast as the
simd
feature impl, but it's pretty close.If I/someone else manages to also make the null case identical, I think we can remove the
simd
version altogether to remove duplicated code.Benchmarks:
Before:
After:
Are there any user-facing changes?
Faster implementation of
sum
.Non-null case is more than 10x faster.
Null case is around 3x faster.