-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SIMD comparison operations for types with less than 4 lanes (i128) #1146
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1146 +/- ##
==========================================
- Coverage 82.55% 82.55% -0.01%
==========================================
Files 169 169
Lines 50456 50456
==========================================
- Hits 41655 41653 -2
- Misses 8801 8803 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Co-authored-by: Paddy Horan <5733408+paddyhoran@users.noreply.github.com>
Co-authored-by: Paddy Horan <5733408+paddyhoran@users.noreply.github.com>
@@ -2723,8 +2743,6 @@ mod tests { | |||
); | |||
} | |||
|
|||
// Fails when simd is enabled: https://github.com/apache/arrow-rs/issues/1136 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @jhorstmann fyi @tustvold |
Which issue does this PR close?
Implements comparison for simd types with less than 8 lanes.
Closes #1136 .
What changes are included in this PR?
This PR changes the comparison kernel so that the simd portion can always append 64 bits at a time. Since the simd types are 512 bits wide, this means the inner comparison is unrolled, for example 8 times for Float64 (8 x 8lanes) or 4 times for Float32 (4 x 16lanes). For Int8 types it does not get unrolled since one comparison already results in 64 bits.
This should even speed up the comparison kernel a bit for common types, because there is less loop overhead.
On my laptop the simd version for
i128
MonthDayNano
types is not actually faster than the scalar version, on a more modern or server class machine there should be a slight speedup.Unrelated to this change I also noticed that the code generation for non-avx512 machines is sub-optimal since the compiler has to emulate the 512bit wider operations using smaller vector registers, and for the bitmap generating code this has some overhead.
Are there any user-facing changes?