Perf: Make sort string view fast(1.5X ~ 3X faster) #7792

zhuqi-lucas · 2025-06-26T13:57:51Z

Which issue does this PR close?

This is a follow-up for #7748

In theory we can custom string view compare, and make it crazy faster.

Closes #7790

Rationale for this change

In theory we can custom string view compare, and make it crazy faster.

What changes are included in this PR?

In theory we can custom string view compare, and make it crazy faster.

Are these changes tested?

Yes

Are there any user-facing changes?

No

zhuqi-lucas · 2025-06-26T13:58:48Z

The result for the PR for sort string view is crazy faster, 1.5X ~ 3X

critcmp  fast_sort_string_view main --filter "sort string_view"
group                                                   fast_sort_string_view                  main
-----                                                   ---------------------                  ----
sort string_view[0-400] nulls to indices 2^12           1.00     32.3±0.33µs        ? ?/sec    1.94     62.7±2.27µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00     55.9±0.48µs        ? ?/sec    2.17    121.5±2.48µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00     36.8±0.39µs        ? ?/sec    1.56     57.4±1.19µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00     62.8±0.79µs        ? ?/sec    1.69    106.3±1.14µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00     35.2±0.56µs        ? ?/sec    2.25     79.3±3.75µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     59.7±0.60µs        ? ?/sec    2.95   176.5±14.03µs        ? ?/sec

Dandandan · 2025-06-26T22:56:02Z

arrow-ord/src/sort.rs

+    let inline_u128 = u128::from_le_bytes(inline_bytes).to_be();
+
+    // Shift right by 32 bits to discard the zero padding (upper 4 bytes),
+    // so that the inline string occupies the high 96 bits


maybe makes sense to use a mask here instead of right / left shifting?

Thank you @Dandandan for review, it try to use mask replacing the right / left shifting, but the performance no change, it believe the optimizer do some optimize already:

sort string_view[10] to indices 2^12 time: [62.344 µs 62.458 µs 62.603 µs] change: [−0.7065% −0.3295% +0.0918%] (p = 0.11 > 0.05) No change in performance detected. Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 5 (5.00%) high mild 4 (4.00%) high severe sort string_view[10] nulls to indices 2^12 time: [36.626 µs 36.673 µs 36.719 µs] change: [−0.4095% −0.1715% +0.0813%] (p = 0.17 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) low mild 3 (3.00%) high mild sort string_view[0-400] to indices 2^12 time: [55.812 µs 55.875 µs 55.939 µs] change: [−0.1893% +0.0245% +0.2323%] (p = 0.82 > 0.05) No change in performance detected. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe sort string_view[0-400] nulls to indices 2^12 time: [32.269 µs 32.310 µs 32.351 µs] change: [−0.1251% +0.1156% +0.3441%] (p = 0.34 > 0.05) No change in performance detected. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe sort string_view_inlined[0-12] to indices 2^12 time: [59.737 µs 59.853 µs 60.040 µs] change: [+0.1045% +0.3772% +0.6674%] (p = 0.01 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 3 (3.00%) high mild 6 (6.00%) high severe sort string_view_inlined[0-12] nulls to indices 2^12 time: [35.176 µs 35.237 µs 35.299 µs] change: [−0.0949% +0.2861% +0.6023%] (p = 0.12 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild

This is the applied patch for testing:

diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs index 8796907669..2b0518c704 100644 --- a/arrow-ord/src/sort.rs +++ b/arrow-ord/src/sort.rs @@ -320,27 +320,37 @@ fn sort_bytes<T: ByteArrayType>( /// and the length in the lower 32 bits (big-endian). #[inline(always)] pub fn inline_key_fast(raw: u128) -> u128 { - // Convert the raw u128 (little-endian) into bytes for manipulation + // Interpret the 128-bit input as little-endian bytes let raw_bytes = raw.to_le_bytes(); - // Extract the length (first 4 bytes), convert to big-endian u32, and promote to u128 + // --- Step 1: Extract and convert the length field --- // + // The first 4 bytes (little-endian) encode the string length. + // Read them as u32 in little-endian order, then convert to big-endian + // and promote to u128 so it can occupy the low 32 bits of the key. let len_le = &raw_bytes[0..4]; - let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as u128; + let len_be = u32::from_le_bytes(len_le.try_into().unwrap()) + .to_be() as u128; - // Extract the inline string bytes (next 12 bytes), place them into the lower 12 bytes of a 16-byte array, - // padding the upper 4 bytes with zero to form a little-endian u128 value + // --- Step 2: Extract the inlined string bytes --- // + // Bytes 4..16 contain up to 12 bytes of the inline string. + // Copy them into bytes 4..16 of a 16-byte buffer, leaving top 4 bytes zero. let mut inline_bytes = [0u8; 16]; inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]); - // Convert to big-endian to ensure correct lexical ordering - let inline_u128 = u128::from_le_bytes(inline_bytes).to_be(); + // Convert that buffer to a big-endian u128 so that + // byte-wise lexical order corresponds to numeric order. + let inline_be = u128::from_le_bytes(inline_bytes).to_be(); - // Shift right by 32 bits to discard the zero padding (upper 4 bytes), - // so that the inline string occupies the high 96 bits - let inline_part = inline_u128 >> 32; + // --- Step 3: Mask out the padding and isolate the inline portion --- // + // Define a mask that selects bits 32..128 (the 12 inlined bytes in BE) + // and zeroes out the high and low 32 bits. + const INLINE_MASK: u128 = 0x00FF_FFFF_FFFF_FFFF_FFFF_FFFF_0000_0000u128; + let inline_part = inline_be & INLINE_MASK; - // Combine the inline string part (high 96 bits) and length (low 32 bits) into the final key - (inline_part << 32) | len_be + // --- Step 4: Combine masked inline portion with length --- // + // The inline bytes already occupy bits 32..128 after masking. + // OR in the big-endian length (bits 0..32) to form the final key. + inline_part | len_be } fn sort_byte_view<T: ByteViewType>(

And the code copy from this PR, until each one merged, i can reuse the code:

https://github.com/apache/arrow-rs/pull/7748/files#diff-160ecd8082d5d28081f01cdb08a898cb8f49b17149c7118bf96746ddaae24b4fR592

Ah I missed that it was reused, let's merge that one first then and reuse it here!

alamb

Thanks @zhuqi-lucas and @Dandandan -- this looks pretty cool. I think I must be missing something about how this u128 is constructed -- I left some comments. Let me know what you think

alamb · 2025-06-27T13:29:45Z

arrow-ord/src/sort.rs

+        }
+
+        // 3.2 Compare 4-byte prefix in big-endian order
+        let pref_a = ByteView::from(raw_a).prefix.swap_bytes();


I think ByteView can only be used if the view len is greater than 12:

arrow-rs/arrow-data/src/byte_view.rs

Lines 29 to 30 in 10d9714

/// Helper to access views of [`GenericByteViewArray`] (`StringViewArray` and

/// `BinaryViewArray`) where the length is greater than 12 bytes.

Isn't this potentially comparing a inline view and a non inline view?

Thank you @alamb for review, this actually copy from another PR:
#7748

And i add the comments to it here, i also can change to the raw convert instead of using ByteView::from, but i think it's safe that we only using the prefix here, and inline/unlined all have the prefix.

https://github.com/apache/arrow-rs/pull/7748/files#diff-160ecd8082d5d28081f01cdb08a898cb8f49b17149c7118bf96746ddaae24b4fR560

May be we can make another PR ready to merge, then i can just copy the same code from there, thanks a lot!

alamb · 2025-06-27T13:30:12Z

arrow-ord/src/sort.rs

+            return pref_a.cmp(&pref_b);
+        }
+
+        // 3.3 Fallback to full byte-slice comparison


isn't this only valid if both views are not inlined (len > 12)? I am not sure it is ok to lexographically compare an inlined view and a non inline view 🤔 Won't that potentially compare the buffer offset to part of the strings?

Update: I double checked and value_unchecked correctly handles short/long views

Though it makes me wonder if we could squeeze more out of this function by handling the three cases explicitly (short, long), (long, short) and (long, long)

However, that might have a minimal payback

alamb · 2025-06-27T13:33:15Z

arrow-ord/src/sort.rs

+/// and packs them into a single u128 value suitable for fast comparisons.
+///
+/// # Note
+/// The input `raw` is assumed to be in little-endian format with the following layout:


here is some potentially useful ascii art:

StringView Format ┌───────────────────────────────────────────┬──────────────┐ │ data │ length │ Strings, len <= 12 │ (padded with \0) │ (u32) │ (Inline) │ │ │ └───────────────────────────────────────────┴──────────────┘ 127 31 0 bit offset ┌──────────────┬─────────────┬──────────────┬──────────────┐ │buffer offset │ buffer index│ data prefix │ length │ Strings, len > 12 │ (u32) │ (u32) │ (4 bytes) │ (u32) │ (Offset) │ │ │ │ │ └──────────────┴─────────────┴──────────────┴──────────────┘ 127 95 63 31 0 bit offset

And the raw monodraw file:
stringview.zip

alamb · 2025-06-27T13:33:49Z

arrow-ord/src/sort.rs

    sort_impl(options, &mut valids, &nulls, limit, Ord::cmp).into()
 }

+/// Builds a 128-bit composite key for an inline value:


This comment does a good job explaining what this function does but I think it would be good if we could explain the why it is useful a bit more specifically than "fast comparisons". Spcifically what property(s) the returned u128 has

Something like this

Suggested change

/// Builds a 128-bit composite key for an inline value:

/// Builds a 128-bit composite key for an inline value for fast sorting

///

/// The `u128` returned by this function compares the same comparing

/// the `str` value from an inlined view.

I am not quite sure is that is correct because if it is, I don't understand why it is copying the length bytes as well 🤔

Thank you @alamb for review, good suggestion!

And the idea is coming from the comments here:

#7748 (comment)

May be we can get the based PR ready first, thanks a lot!

yes, sorry -- I missed the other one -- let's work on #7748 first

alamb · 2025-06-27T13:35:11Z

arrow-ord/src/sort.rs

+/// The output u128 key places the inline string data in the upper 96 bits (big-endian)
+/// and the length in the lower 32 bits (big-endian).
+#[inline(always)]
+pub fn inline_key_fast(raw: u128) -> u128 {


before merging this PR, I think we should write some unit tests showing that the output u128 of this function do indeed compare the same as the corresponding &str representations (assuming I am not missing something)

I see now you did this in #7748 -- I will continue the conversation there

Thank you @alamb , i added some testing for another PR which is this PR based:

#7748

May be we can get it ready first, thanks a lot!

zhuqi-lucas · 2025-06-30T11:00:24Z

Thank you @alamb for review, now the PR has been updated based the merged PR which contains the new API for fast path.

alamb · 2025-07-01T19:00:42Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_sort_string_view (d6814a7) to a9f316b diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_sort_string_view
Results will be posted here when complete

alamb · 2025-07-01T19:15:40Z

🤖: Benchmark completed

Details

group                                                   fast_sort_string_view                  main
-----                                                   ---------------------                  ----
lexsort (bool, bool) 2^12                               1.00    117.2±0.58µs        ? ?/sec    1.00    116.8±0.36µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.5±0.30µs        ? ?/sec    1.00    156.0±0.26µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.01     45.3±0.09µs        ? ?/sec    1.00     45.0±0.08µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    211.8±0.35µs        ? ?/sec    1.00    212.6±0.33µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.0±0.05µs        ? ?/sec    1.00     38.1±0.08µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     40.5±0.07µs        ? ?/sec    1.00     40.4±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.4±0.16µs        ? ?/sec    1.01     79.4±0.13µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.8±0.29µs        ? ?/sec    1.00    212.5±0.50µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     53.5±0.11µs        ? ?/sec    1.00     52.8±0.17µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    252.2±0.50µs        ? ?/sec    1.00    248.7±0.66µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     85.0±0.18µs        ? ?/sec    1.00     85.0±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     86.0±0.20µs        ? ?/sec    1.00     86.0±0.19µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00     95.8±0.15µs        ? ?/sec    1.00     95.8±0.28µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    252.5±0.41µs        ? ?/sec    1.00    249.3±0.66µs        ? ?/sec
rank f32 2^12                                           1.00     68.3±0.17µs        ? ?/sec    1.02     69.4±1.68µs        ? ?/sec
rank f32 nulls 2^12                                     1.00     37.7±0.05µs        ? ?/sec    1.02     38.2±0.26µs        ? ?/sec
rank string[10] 2^12                                    1.00    234.4±0.62µs        ? ?/sec    1.00    233.7±0.40µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    115.8±0.35µs        ? ?/sec    1.00    115.7±0.26µs        ? ?/sec
sort f32 2^12                                           1.04     64.8±0.25µs        ? ?/sec    1.00     62.4±0.45µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     32.1±0.12µs        ? ?/sec    1.00     32.2±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     70.5±0.22µs        ? ?/sec    1.00     70.3±0.25µs        ? ?/sec
sort f32 to indices 2^12                                1.00     77.4±0.26µs        ? ?/sec    1.00     77.0±0.31µs        ? ?/sec
sort i32 2^10                                           1.01      8.5±0.02µs        ? ?/sec    1.00      8.5±0.02µs        ? ?/sec
sort i32 2^12                                           1.01     42.5±0.24µs        ? ?/sec    1.00     42.1±0.16µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      5.4±0.02µs        ? ?/sec    1.04      5.6±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     22.9±0.09µs        ? ?/sec    1.06     24.2±0.04µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00     12.3±0.05µs        ? ?/sec    1.01     12.5±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     52.8±0.13µs        ? ?/sec    1.00     52.9±0.15µs        ? ?/sec
sort i32 to indices 2^10                                1.00     11.4±0.02µs        ? ?/sec    1.00     11.5±0.02µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.3±0.17µs        ? ?/sec    1.00     55.3±0.17µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.4±0.01µs        ? ?/sec    1.09      7.0±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.04      9.0±0.01µs        ? ?/sec    1.00      8.6±0.02µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    170.0±0.65µs        ? ?/sec    1.00    170.3±1.73µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    299.8±0.73µs        ? ?/sec    1.01    301.5±0.72µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.01    137.2±0.34µs        ? ?/sec    1.00    136.2±0.26µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    230.6±0.30µs        ? ?/sec    1.00    229.6±0.30µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     82.8±0.18µs        ? ?/sec    1.78    147.8±0.54µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    121.8±0.32µs        ? ?/sec    2.21    269.1±0.56µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00    108.9±0.24µs        ? ?/sec    1.28    139.7±0.43µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    173.0±0.38µs        ? ?/sec    1.34    231.2±0.34µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00    103.7±0.30µs        ? ?/sec    1.40    145.6±0.33µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00    163.4±0.42µs        ? ?/sec    1.51    247.1±0.62µs        ? ?/sec

alamb · 2025-07-01T19:20:44Z

Those are some impressive numbers.

group                                                   fast_sort_string_view                  main
-----                                                   ---------------------                  ----
sort string_view[0-400] nulls to indices 2^12           1.00     82.8±0.18µs        ? ?/sec    1.78    147.8±0.54µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    121.8±0.32µs        ? ?/sec    2.21    269.1±0.56µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00    108.9±0.24µs        ? ?/sec    1.28    139.7±0.43µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    173.0±0.38µs        ? ?/sec    1.34    231.2±0.34µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00    103.7±0.30µs        ? ?/sec    1.40    145.6±0.33µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00    163.4±0.42µs        ? ?/sec    1.51    247.1±0.62µs        ? ?/sec

I am now reviewing this PR in more detail

adriangb · 2025-07-01T19:24:01Z

This is amazing - super exciting!

alamb

Thank you @zhuqi-lucas -- I went over this PR again carefully and I think it is really nice and quite clever

FYI @Dandandan and @XiangpengHao 🚀

alamb · 2025-07-01T19:37:01Z

arrow-ord/src/sort.rs

+            return pref_a.cmp(&pref_b);
+        }
+
+        // 3.3 Fallback to full byte-slice comparison


Update: I double checked and value_unchecked correctly handles short/long views

Though it makes me wonder if we could squeeze more out of this function by handling the three cases explicitly (short, long), (long, short) and (long, long)

However, that might have a minimal payback

alamb · 2025-07-01T19:38:52Z

and make it crazy faster.

This is a great description

Dandandan

🚀 🚀 🚀

Dandandan · 2025-07-01T19:51:06Z

CRAZY STUFF!!!

Dandandan · 2025-07-01T19:53:12Z

Very impressive @zhuqi-lucas you are taking it this far.
Excited to see the results compounded in the benchmarks.

zhuqi-lucas · 2025-07-02T02:23:13Z

Thank you @alamb @Dandandan @adriangb for review!

alamb · 2025-07-02T19:49:28Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_sort_string_view (d6814a7) to a9f316b diff
BENCH_NAME=sort_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_sort_string_view
Results will be posted here when complete

alamb · 2025-07-03T11:05:45Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_sort_string_view (d6814a7) to a9f316b diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_sort_string_view
Results will be posted here when complete

alamb · 2025-07-03T11:20:45Z

🤖: Benchmark completed

Details

group                                                   fast_sort_string_view                  main
-----                                                   ---------------------                  ----
lexsort (bool, bool) 2^12                               1.01    117.4±1.95µs        ? ?/sec    1.00    115.9±0.30µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.00    156.5±0.29µs        ? ?/sec    1.00    156.3±0.21µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.07µs        ? ?/sec    1.00     45.2±0.06µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    212.4±0.27µs        ? ?/sec    1.00    211.5±0.21µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.1±0.07µs        ? ?/sec    1.01     38.5±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     40.8±0.09µs        ? ?/sec    1.04     42.6±0.05µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.4±0.17µs        ? ?/sec    1.03     80.6±3.26µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.6±0.30µs        ? ?/sec    1.00    212.0±0.43µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.01     53.7±0.18µs        ? ?/sec    1.00     53.0±0.21µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.01    252.1±0.36µs        ? ?/sec    1.00    249.3±0.45µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.01     85.2±0.27µs        ? ?/sec    1.00     84.7±0.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.01     86.2±0.31µs        ? ?/sec    1.00     85.6±0.16µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00     96.0±0.31µs        ? ?/sec    1.00     95.6±0.30µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.01    252.1±0.41µs        ? ?/sec    1.00    248.9±0.61µs        ? ?/sec
rank f32 2^12                                           1.00     68.2±0.27µs        ? ?/sec    1.01     69.1±0.18µs        ? ?/sec
rank f32 nulls 2^12                                     1.00     37.6±0.10µs        ? ?/sec    1.01     38.1±0.04µs        ? ?/sec
rank string[10] 2^12                                    1.00    233.4±0.46µs        ? ?/sec    1.01    234.7±0.31µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    115.2±0.28µs        ? ?/sec    1.01    115.8±0.25µs        ? ?/sec
sort f32 2^12                                           1.04     64.8±0.21µs        ? ?/sec    1.00     62.2±0.84µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     32.1±0.10µs        ? ?/sec    1.00     32.2±0.11µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.01     70.9±0.18µs        ? ?/sec    1.00     70.3±0.16µs        ? ?/sec
sort f32 to indices 2^12                                1.01     77.5±0.27µs        ? ?/sec    1.00     77.0±0.26µs        ? ?/sec
sort i32 2^10                                           1.01      8.5±0.01µs        ? ?/sec    1.00      8.5±0.01µs        ? ?/sec
sort i32 2^12                                           1.01     42.4±0.20µs        ? ?/sec    1.00     42.1±0.58µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      5.4±0.01µs        ? ?/sec    1.04      5.6±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     22.9±0.07µs        ? ?/sec    1.06     24.2±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.00     12.3±0.02µs        ? ?/sec    1.02     12.5±0.03µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     52.7±0.12µs        ? ?/sec    1.01     53.0±0.24µs        ? ?/sec
sort i32 to indices 2^10                                1.00     11.4±0.02µs        ? ?/sec    1.01     11.5±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.3±0.12µs        ? ?/sec    1.01     55.7±0.12µs        ? ?/sec
sort primitive run 2^12                                 1.00      6.3±0.01µs        ? ?/sec    1.13      7.1±0.02µs        ? ?/sec
sort primitive run to indices 2^12                      1.04      9.0±0.02µs        ? ?/sec    1.00      8.7±0.01µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    169.3±0.30µs        ? ?/sec    1.00    170.0±0.34µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    298.4±0.42µs        ? ?/sec    1.00    298.7±0.63µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.01    136.9±0.30µs        ? ?/sec    1.00    135.9±0.28µs        ? ?/sec
sort string[10] to indices 2^12                         1.00    229.7±0.29µs        ? ?/sec    1.00    230.2±0.29µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.00     83.0±0.17µs        ? ?/sec    1.78    147.9±0.33µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    122.1±0.27µs        ? ?/sec    2.22    270.8±0.54µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.00    108.8±0.31µs        ? ?/sec    1.28    139.7±0.29µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    173.1±0.83µs        ? ?/sec    1.34    231.3±0.33µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.00    103.6±0.32µs        ? ?/sec    1.40    145.5±0.23µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00    163.1±0.38µs        ? ?/sec    1.51    246.8±0.72µs        ? ?/sec

zhuqi-lucas added 2 commits June 26, 2025 21:36

Init version for fast sort string view

d724386

make sort crazy fast for string view

e1ccab4

github-actions bot added the arrow Changes to the arrow crate label Jun 26, 2025

zhuqi-lucas mentioned this pull request Jun 26, 2025

Perf: optimize sort string_view performance #7790

Closed

Dandandan reviewed Jun 26, 2025

View reviewed changes

alamb reviewed Jun 27, 2025

View reviewed changes

alamb mentioned this pull request Jun 27, 2025

Perf: Add prefix compare for inlined compare and change use of inline_value to inline it to a u128 #7748

Merged

zhuqi-lucas added 3 commits June 30, 2025 11:40

Remove unused

5000e8a

Merge remote-tracking branch 'upstream/main' into fast_sort_string_view

463ff19

update logic

d6814a7

alamb approved these changes Jul 1, 2025

View reviewed changes

alamb added the performance label Jul 1, 2025

Dandandan approved these changes Jul 1, 2025

View reviewed changes

Dandandan merged commit 52ad7d7 into apache:main Jul 1, 2025
17 checks passed

zhuqi-lucas mentioned this pull request Jul 2, 2025

Improve StringArray(Utf8) sort performance #7847

Closed

zhuqi-lucas mentioned this pull request Jul 6, 2025

fix: Incorrect inlined string view comparison after Add prefix compar… #7875

Merged

	/// Helper to access views of [`GenericByteViewArray`] (`StringViewArray` and
	/// `BinaryViewArray`) where the length is greater than 12 bytes.

-/// Builds a 128-bit composite key for an inline value:
+/// Builds a 128-bit composite key for an inline value for fast sorting
+///
+/// The `u128` returned by this function compares the same comparing
+/// the `str` value from an inlined view.

Perf: Make sort string view fast(1.5X ~ 3X faster) #7792

Perf: Make sort string view fast(1.5X ~ 3X faster) #7792

Uh oh!

Conversation

zhuqi-lucas commented Jun 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jun 30, 2025

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

adriangb commented Jul 1, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jul 1, 2025

Uh oh!

Dandandan commented Jul 1, 2025

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 2, 2025

Uh oh!

alamb commented Jul 2, 2025

Uh oh!

alamb commented Jul 3, 2025

zhuqi-lucas Jun 27, 2025 •

edited

Loading