-
Notifications
You must be signed in to change notification settings - Fork 1k
Perf: Make sort string view fast(1.5X ~ 3X faster) #7792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The result for the PR for sort string view is crazy faster, 1.5X ~ 3X critcmp fast_sort_string_view main --filter "sort string_view"
group fast_sort_string_view main
----- --------------------- ----
sort string_view[0-400] nulls to indices 2^12 1.00 32.3±0.33µs ? ?/sec 1.94 62.7±2.27µs ? ?/sec
sort string_view[0-400] to indices 2^12 1.00 55.9±0.48µs ? ?/sec 2.17 121.5±2.48µs ? ?/sec
sort string_view[10] nulls to indices 2^12 1.00 36.8±0.39µs ? ?/sec 1.56 57.4±1.19µs ? ?/sec
sort string_view[10] to indices 2^12 1.00 62.8±0.79µs ? ?/sec 1.69 106.3±1.14µs ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12 1.00 35.2±0.56µs ? ?/sec 2.25 79.3±3.75µs ? ?/sec
sort string_view_inlined[0-12] to indices 2^12 1.00 59.7±0.60µs ? ?/sec 2.95 176.5±14.03µs ? ?/sec |
arrow-ord/src/sort.rs
Outdated
let inline_u128 = u128::from_le_bytes(inline_bytes).to_be(); | ||
|
||
// Shift right by 32 bits to discard the zero padding (upper 4 bytes), | ||
// so that the inline string occupies the high 96 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe makes sense to use a mask here instead of right / left shifting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for review, it try to use mask replacing the right / left shifting, but the performance no change, it believe the optimizer do some optimize already:
sort string_view[10] to indices 2^12
time: [62.344 µs 62.458 µs 62.603 µs]
change: [−0.7065% −0.3295% +0.0918%] (p = 0.11 > 0.05)
No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe
sort string_view[10] nulls to indices 2^12
time: [36.626 µs 36.673 µs 36.719 µs]
change: [−0.4095% −0.1715% +0.0813%] (p = 0.17 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
3 (3.00%) high mild
sort string_view[0-400] to indices 2^12
time: [55.812 µs 55.875 µs 55.939 µs]
change: [−0.1893% +0.0245% +0.2323%] (p = 0.82 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
sort string_view[0-400] nulls to indices 2^12
time: [32.269 µs 32.310 µs 32.351 µs]
change: [−0.1251% +0.1156% +0.3441%] (p = 0.34 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
sort string_view_inlined[0-12] to indices 2^12
time: [59.737 µs 59.853 µs 60.040 µs]
change: [+0.1045% +0.3772% +0.6674%] (p = 0.01 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
sort string_view_inlined[0-12] nulls to indices 2^12
time: [35.176 µs 35.237 µs 35.299 µs]
change: [−0.0949% +0.2861% +0.6023%] (p = 0.12 > 0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the applied patch for testing:
diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs
index 8796907669..2b0518c704 100644
--- a/arrow-ord/src/sort.rs
+++ b/arrow-ord/src/sort.rs
@@ -320,27 +320,37 @@ fn sort_bytes<T: ByteArrayType>(
/// and the length in the lower 32 bits (big-endian).
#[inline(always)]
pub fn inline_key_fast(raw: u128) -> u128 {
- // Convert the raw u128 (little-endian) into bytes for manipulation
+ // Interpret the 128-bit input as little-endian bytes
let raw_bytes = raw.to_le_bytes();
- // Extract the length (first 4 bytes), convert to big-endian u32, and promote to u128
+ // --- Step 1: Extract and convert the length field --- //
+ // The first 4 bytes (little-endian) encode the string length.
+ // Read them as u32 in little-endian order, then convert to big-endian
+ // and promote to u128 so it can occupy the low 32 bits of the key.
let len_le = &raw_bytes[0..4];
- let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as u128;
+ let len_be = u32::from_le_bytes(len_le.try_into().unwrap())
+ .to_be() as u128;
- // Extract the inline string bytes (next 12 bytes), place them into the lower 12 bytes of a 16-byte array,
- // padding the upper 4 bytes with zero to form a little-endian u128 value
+ // --- Step 2: Extract the inlined string bytes --- //
+ // Bytes 4..16 contain up to 12 bytes of the inline string.
+ // Copy them into bytes 4..16 of a 16-byte buffer, leaving top 4 bytes zero.
let mut inline_bytes = [0u8; 16];
inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]);
- // Convert to big-endian to ensure correct lexical ordering
- let inline_u128 = u128::from_le_bytes(inline_bytes).to_be();
+ // Convert that buffer to a big-endian u128 so that
+ // byte-wise lexical order corresponds to numeric order.
+ let inline_be = u128::from_le_bytes(inline_bytes).to_be();
- // Shift right by 32 bits to discard the zero padding (upper 4 bytes),
- // so that the inline string occupies the high 96 bits
- let inline_part = inline_u128 >> 32;
+ // --- Step 3: Mask out the padding and isolate the inline portion --- //
+ // Define a mask that selects bits 32..128 (the 12 inlined bytes in BE)
+ // and zeroes out the high and low 32 bits.
+ const INLINE_MASK: u128 = 0x00FF_FFFF_FFFF_FFFF_FFFF_FFFF_0000_0000u128;
+ let inline_part = inline_be & INLINE_MASK;
- // Combine the inline string part (high 96 bits) and length (low 32 bits) into the final key
- (inline_part << 32) | len_be
+ // --- Step 4: Combine masked inline portion with length --- //
+ // The inline bytes already occupy bits 32..128 after masking.
+ // OR in the big-endian length (bits 0..32) to form the final key.
+ inline_part | len_be
}
fn sort_byte_view<T: ByteViewType>(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the code copy from this PR, until each one merged, i can reuse the code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that it was reused, let's merge that one first then and reuse it here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas and @Dandandan -- this looks pretty cool. I think I must be missing something about how this u128 is constructed -- I left some comments. Let me know what you think
} | ||
|
||
// 3.2 Compare 4-byte prefix in big-endian order | ||
let pref_a = ByteView::from(raw_a).prefix.swap_bytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ByteView can only be used if the view len is greater than 12:
arrow-rs/arrow-data/src/byte_view.rs
Lines 29 to 30 in 10d9714
/// Helper to access views of [`GenericByteViewArray`] (`StringViewArray` and | |
/// `BinaryViewArray`) where the length is greater than 12 bytes. |
Isn't this potentially comparing a inline view and a non inline view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review, this actually copy from another PR:
#7748
And i add the comments to it here, i also can change to the raw convert instead of using ByteView::from, but i think it's safe that we only using the prefix here, and inline/unlined all have the prefix.
May be we can make another PR ready to merge, then i can just copy the same code from there, thanks a lot!
return pref_a.cmp(&pref_b); | ||
} | ||
|
||
// 3.3 Fallback to full byte-slice comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this only valid if both views are not inlined (len > 12)? I am not sure it is ok to lexographically compare an inlined view and a non inline view 🤔 Won't that potentially compare the buffer offset to part of the strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I double checked and value_unchecked
correctly handles short/long views
Though it makes me wonder if we could squeeze more out of this function by handling the three cases explicitly (short, long), (long, short) and (long, long)
However, that might have a minimal payback
arrow-ord/src/sort.rs
Outdated
/// and packs them into a single u128 value suitable for fast comparisons. | ||
/// | ||
/// # Note | ||
/// The input `raw` is assumed to be in little-endian format with the following layout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is some potentially useful ascii art:
StringView Format
┌───────────────────────────────────────────┬──────────────┐
│ data │ length │
Strings, len <= 12 │ (padded with \0) │ (u32) │
(Inline) │ │ │
└───────────────────────────────────────────┴──────────────┘
127 31 0 bit
offset
┌──────────────┬─────────────┬──────────────┬──────────────┐
│buffer offset │ buffer index│ data prefix │ length │
Strings, len > 12 │ (u32) │ (u32) │ (4 bytes) │ (u32) │
(Offset) │ │ │ │ │
└──────────────┴─────────────┴──────────────┴──────────────┘
127 95 63 31 0 bit
offset
And the raw monodraw file:
stringview.zip
arrow-ord/src/sort.rs
Outdated
sort_impl(options, &mut valids, &nulls, limit, Ord::cmp).into() | ||
} | ||
|
||
/// Builds a 128-bit composite key for an inline value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does a good job explaining what this function does but I think it would be good if we could explain the why it is useful a bit more specifically than "fast comparisons". Spcifically what property(s) the returned u128
has
Something like this
/// Builds a 128-bit composite key for an inline value: | |
/// Builds a 128-bit composite key for an inline value for fast sorting | |
/// | |
/// The `u128` returned by this function compares the same comparing | |
/// the `str` value from an inlined view. |
I am not quite sure is that is correct because if it is, I don't understand why it is copying the length bytes as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review, good suggestion!
And the idea is coming from the comments here:
May be we can get the based PR ready first, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry -- I missed the other one -- let's work on #7748 first
arrow-ord/src/sort.rs
Outdated
/// The output u128 key places the inline string data in the upper 96 bits (big-endian) | ||
/// and the length in the lower 32 bits (big-endian). | ||
#[inline(always)] | ||
pub fn inline_key_fast(raw: u128) -> u128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before merging this PR, I think we should write some unit tests showing that the output u128 of this function do indeed compare the same as the corresponding &str
representations (assuming I am not missing something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now you did this in #7748 -- I will continue the conversation there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review, now the PR has been updated based the merged PR which contains the new API for fast path. |
🤖 |
🤖: Benchmark completed Details
|
Those are some impressive numbers.
I am now reviewing this PR in more detail |
This is amazing - super exciting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- I went over this PR again carefully and I think it is really nice and quite clever
FYI @Dandandan and @XiangpengHao 🚀
return pref_a.cmp(&pref_b); | ||
} | ||
|
||
// 3.3 Fallback to full byte-slice comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I double checked and value_unchecked
correctly handles short/long views
Though it makes me wonder if we could squeeze more out of this function by handling the three cases explicitly (short, long), (long, short) and (long, long)
However, that might have a minimal payback
This is a great description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀 🚀
CRAZY STUFF!!! |
Very impressive @zhuqi-lucas you are taking it this far. |
Thank you @alamb @Dandandan @adriangb for review! |
🤖 |
🤖 |
🤖: Benchmark completed Details
|
Which issue does this PR close?
This is a follow-up for #7748
In theory we can custom string view compare, and make it crazy faster.
Rationale for this change
In theory we can custom string view compare, and make it crazy faster.
What changes are included in this PR?
In theory we can custom string view compare, and make it crazy faster.
Are these changes tested?
Yes
Are there any user-facing changes?
No