You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In one of our benchmarks the concat kernel was identified as a big performance bottleneck while sorting, specifically the closures inside build_extend_null_bits which calls set_bits. The logic in there currently sets individual bits and also contains a branch for every bit
if bit_util::get_bit(...) {
bit_util::set_bit(...);
}
I think it should be possible to rewrite this to set multiple bits at the same time and remove most of the branch overhead. The general idea would look like this:
append individual bits until the destination buffer starts at a byte offset
start a BitChunk iterator on the source buffer and then append u8 or u64 at a time
append the remainder u8 at a time
Similar logic would apply to setting all bits to valid, appending chunks of u8::MAX or u64::MAX at a time.
The get_bit / set_bit functions themselves could probably also be speed up a little, I think on modern processors calculating the bit masks instead of using a lookup table should be faster. But after the above changes, those functions would no longer be used in the hot path.
The text was updated successfully, but these errors were encountered:
In one of our benchmarks the
concat
kernel was identified as a big performance bottleneck while sorting, specifically the closures insidebuild_extend_null_bits
which callsset_bits
. The logic in there currently sets individual bits and also contains a branch for every bitI think it should be possible to rewrite this to set multiple bits at the same time and remove most of the branch overhead. The general idea would look like this:
Similar logic would apply to setting all bits to valid, appending chunks of u8::MAX or u64::MAX at a time.
The
get_bit
/set_bit
functions themselves could probably also be speed up a little, I think on modern processors calculating the bit masks instead of using a lookup table should be faster. But after the above changes, those functions would no longer be used in the hot path.The text was updated successfully, but these errors were encountered: