-
Notifications
You must be signed in to change notification settings - Fork 1.8k
perf: improve performance of vectorized_equal_to for PrimitiveGroupValueBuilder in multi group by aggregation
#17977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve performance of vectorized_equal_to for PrimitiveGroupValueBuilder in multi group by aggregation
#17977
Conversation
| let iter = izip!( | ||
| lhs_rows.iter(), | ||
| rhs_rows.iter(), | ||
| equal_to_results.iter_mut(), | ||
| ); | ||
|
|
||
| for (&lhs_row, &rhs_row, equal_to_result) in iter { | ||
| // Has found not equal to in previous column, don't need to check | ||
| if !*equal_to_result { | ||
| continue; | ||
| } | ||
|
|
||
| // Perf: skip null check (by short circuit) if input is not nullable | ||
| let exist_null = self.nulls.is_null(lhs_row); | ||
| let input_null = array.is_null(rhs_row); | ||
| if let Some(result) = nulls_equal_to(exist_null, input_null) { | ||
| *equal_to_result = result; | ||
| continue; | ||
| } | ||
|
|
||
| // Otherwise, we need to check their values | ||
| *equal_to_result = self.group_values[lhs_row].is_eq(array.value(rhs_row)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the code from vectorized_equal_to and removed the if NULLABLE as we will always get here if nullable
|
@alamb can you please run
|
| self.group_values[lhs_row] | ||
| } else { | ||
| // SAFETY: indices are guaranteed to be in bounds | ||
| unsafe { *self.group_values.get_unchecked(lhs_row) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As lhs_row is not checked here te be in bounds, this method would need to be marked unsafe as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
|
I could run the benchmarks locally so that we can see the performance gains. Can you provide two git hashes at which to run the criterion benchmark and compare? |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The results look promising and consistent -- thank you @rluvaton -- I plan to review this PR over the next few days |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton -- this PR looks good to me
| unsafe { *array_values.get_unchecked(rhs_row) } | ||
| }; | ||
|
|
||
| // Always evaluate, to allow for auto-vectorization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense for primitive values -- namely that the cost of checking if we should compare dominated just always comparing
This is a good idea. I filed a ticket to track it |
|
🤖: Benchmark completed Details
|
|
As suspected up to 6 times faster:
all below are for The good thing you will notice below is that regardless of the number of true count the results are consistent - meaning it really not using branches and also, the null case which I did not optimize is roughly the same Table
|
|
once we have mutable bit packed buffer, we will also evaluate the nulls separately and do bit mask operation on the compare result to make that case faster as well |
|
Thanks again @rluvaton |
…pValueBuilder` in multi group by aggregation (apache#17977) ## Which issue does this PR close? N/A ## Rationale for this change Making multi column aggregation even faster ## What changes are included in this PR? In `PrimitiveGroupValueBuilder.vectorized_equal_to` always evaluate and use unchecked as both of these changes are what making the code compile to SIMD. ## Are these changes tested? Existing tests ## Are there any user-facing changes? Nope ----- I tried a LOT of variations [GodBolt](https://godbolt.org/z/Kc8ze6E9n) from splitting to fixed size chunks and trying to get auto-vectorization to use gather and creating bitmask to even testing portable SIMD (just to see what it will generate). this version only optimize the non null path for the moment as it is the easiest. once and if we change from `&mut [bool]` to mutable packed bits we could: 1. evaluate in chunks of `64` items (I tried different variations to see what is the best - you can tweak in the godbolt above with different type and size to check for yourself), 64 is not necessarily the best but it will be the fastest I think for doing AND with the `equal_to_results` boolean buffer 2. add optimization for nullable as well by just doing bitwise operation at 64 items at a time and avoid the cost of getting each bit manually 3. skip 64 items right away if the the `equal_to_results` equal to `0x00` (i.e. all false) --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
N/A
Rationale for this change
Making multi column aggregation even faster
What changes are included in this PR?
In
PrimitiveGroupValueBuilder.vectorized_equal_toalways evaluate and use unchecked as both of these changes are what making the code compile to SIMD.Are these changes tested?
Existing tests
Are there any user-facing changes?
Nope
I tried a LOT of variations GodBolt
from splitting to fixed size chunks and trying to get auto-vectorization to use gather and creating bitmask to even testing portable SIMD (just to see what it will generate).
this version only optimize the non null path for the moment as it is the easiest.
once and if we change from
&mut [bool]to mutable packed bits we could:64items (I tried different variations to see what is the best - you can tweak in the godbolt above with different type and size to check for yourself), 64 is not necessarily the best but it will be the fastest I think for doing AND with theequal_to_resultsboolean bufferequal_to_resultsequal to0x00(i.e. all false)