-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: special case bitwise ops when buffers are u64 aligned #8807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 |
| && left_suffix.is_empty() | ||
| && right_suffix.is_empty() | ||
| { | ||
| let result_u64s = left_u64s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty excited to see how much this actually helps with performance. This code should vectorize pretty spectacularly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: 30-50% faster 😎
|
🤖: Benchmark completed Details
|
|
🤖 |
30% - 50% faster 😎 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
It is strange that |
|
Since I think no one really calls the bitwise kernels directly, I would be inclined to consolidate the benchmarks as part of #8806 |
| #[allow(clippy::cast_ptr_alignment)] | ||
| let raw_data = self.buffer.as_ptr() as *const u64; | ||
|
|
||
| // bit-packed buffers are stored starting with the least-significant byte first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't your change of using align_to would not work on all bit alignments? (to_le)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written I think this PR will work on both big and little endian -- as it only applies to buffers which are perfectly aligned and a multiple of a 64-bit boundary (aka there are no byte-wise operations occuring)
I think the endianess will come into play if we try to expand this technique to work data that is not an exact multiple of 64bits (aka the loop ends would likely be different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the user expect it to be at specific order when the callback is called and this violate that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any problem. But I clearly don't really understand the concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider the following (inefficient) code:
let mut index = 0;
bitwise_bin_op_helper(
... some args that are u64 aligned
|left, right| {
for each bit in left and right {
some_other_array[index] = left_bit & right_bit;
index += 1;
}
return left & right
}
)before and after your change will give different order of bits on certain endians
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see
yes I agree that some operation that moves the bits around in the u64 word could give different results on different endianesses
I think that is the case for the current bitwise operations too (they are supposed to be bitwise, not bit shuffling) 🤔
I will try and make a PR to update the docs to make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a glance it does seem that all current uses of bitwise_bin_op_helper/bitwise_unary_op_helper would be safe even if the endianess was swapped.
It is a difference from the bit chunks implementation so I agree that updating the docs would be good.
An example kernel that would fail would be finding the position of the first unset bit in an array, although I'm not sure bitwise_unary_op_helper would be a great solution for that anyways since you'd want "break from loop once found" behavior.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe on a big-endian machine? I see the current bit_chunks has a to_le call, do you need something equivalent?
| if left_prefix.is_empty() | ||
| && right_prefix.is_empty() | ||
| && left_suffix.is_empty() | ||
| && right_suffix.is_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if there is a prefix/suffix could you do u64 operations on the aligned portion and fallback to bitwise operations on the unaligned portion?
That being said, this seems like a fine optimization on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point (as long as the prefix and suffixes are the same length)
Ah, I see the other comment, my bad |
Which issue does this PR close?
apply_unary_opandapply_binary_opbitwise operations #8619Rationale for this change
While messing around with other bitwise operations, I am pretty sure we can optimize these operations more
Let's try using aligned u64 operations when possible
What changes are included in this PR?
Special case bitwise operations when the data is already aligned to u64 (a reasonably common special case)
Are these changes tested?
Yes by CI
Are there any user-facing changes?
No just faster performance