fix: Incorrect inlined string view comparison after Add prefix compar… #7875

zhuqi-lucas · 2025-07-05T15:15:51Z

…e for inlined

Which issue does this PR close?

Closes #7874

Rationale for this change

Change Summary

Rework inline_key_fast to avoid reversing the inline data bytes by removing the global .to_be() on the entire 128‑bit word and instead manually constructing the big‑endian key in two parts: the 96‑bit data portion and the 32‑bit length tiebreaker.

Problem

In the original implementation:

let inline_u128 = u128::from_le_bytes(raw_bytes).to_be();

What went wrong: Calling .to_be() on the full 16‑byte value flips all bytes, including the 12 bytes of inline data.
Consequences: Multi‑byte strings are compared in reverse order — e.g. "backend one" would sort as if it were "eno dnekcab" — so lexicographical ordering is completely inverted.
Corner cases exposed:
“backend one” vs. “backend two”: suffixes “one”/“two” compare incorrectly once reversed.

Solution

#[inline(always)]
    pub fn inline_key_fast(raw: u128) -> u128 {
        // 1. Decompose `raw` into little‑endian bytes:
        //    - raw_bytes[0..4]  = length in LE
        //    - raw_bytes[4..16] = inline string data
        let raw_bytes = raw.to_le_bytes();

        // 2. Numerically truncate to get the low 32‑bit length (endianness‑free).
        let length = raw as u32;

        // 3. Build a 16‑byte buffer in big‑endian order:
        //    - buf[0..12]  = inline string bytes (in original order)
        //    - buf[12..16] = length.to_be_bytes() (BE)
        let mut buf = [0u8; 16];
        buf[0..12].copy_from_slice(&raw_bytes[4..16]); // inline data

        // Why convert length to big-endian for comparison?
        //
        // Rust (on most platforms) stores integers in little-endian format,
        // meaning the least significant byte is at the lowest memory address.
        // For example, an u32 value like 0x22345677 is stored in memory as:
        //
        //   [0x77, 0x56, 0x34, 0x22]  // little-endian layout
        //    ^     ^     ^     ^
        //  LSB   ↑↑↑           MSB
        //
        // This layout is efficient for arithmetic but *not* suitable for
        // lexicographic (dictionary-style) comparison of byte arrays.
        //
        // To compare values by byte order—e.g., for sorted keys or binary trees—
        // we must convert them to **big-endian**, where:
        //
        //   - The most significant byte (MSB) comes first (index 0)
        //   - The least significant byte (LSB) comes last (index N-1)
        //
        // In big-endian, the same u32 = 0x22345677 would be represented as:
        //
        //   [0x22, 0x34, 0x56, 0x77]
        //
        // This ordering aligns with natural string/byte sorting, so calling
        // `.to_be_bytes()` allows us to construct
        // keys where standard numeric comparison (e.g., `<`, `>`) behaves
        // like lexicographic byte comparison.
        buf[12..16].copy_from_slice(&length.to_be_bytes()); // length in BE

        // 4. Deserialize the buffer as a big‑endian u128:
        //    buf[0] is MSB, buf[15] is LSB.
        // Details:
        // Note on endianness and layout:
        //
        // Although `buf[0]` is stored at the lowest memory address,
        // calling `u128::from_be_bytes(buf)` interprets it as the **most significant byte (MSB)**,
        // and `buf[15]` as the **least significant byte (LSB)**.
        //
        // This is the core principle of **big-endian decoding**:
        //   - Byte at index 0 maps to bits 127..120 (highest)
        //   - Byte at index 1 maps to bits 119..112
        //   - ...
        //   - Byte at index 15 maps to bits 7..0 (lowest)
        //
        // So even though memory layout goes from low to high (left to right),
        // big-endian treats the **first byte** as highest in value.
        //
        // This guarantees that comparing two `u128` keys is equivalent to lexicographically
        // comparing the original inline bytes, followed by length.
        u128::from_be_bytes(buf)
    }

Testing

All existing tests — including the “backend one” vs. “backend two” and "bar" vs. "bar\0" cases — now pass, confirming both lexicographical correctness and proper length‑based tiebreaking.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

…e for inlined

zhuqi-lucas · 2025-07-05T15:36:37Z

cc @alamb @Dandandan

alamb

Thank you @zhuqi-lucas -- I verified this fixes the issue I was seeing in DataFusion 🙏

Given the subtlety here, I think a few more tests are warranted -- I left some more suggestions. Let's try and make sure we are 100% sure about this one

arrow-array/src/array/byte_view_array.rs

alamb · 2025-07-05T19:46:42Z

arrow-array/src/array/byte_view_array.rs

+            // This pair verifies that we didn’t accidentally reverse the inline bytes:
+            // without our fix, “backend one” would compare as if it were
+            //    “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+            b"backend one", // special case: tests byte-order reversal bug


I wonder why this wasn't this caught withthe xyy and xyz test case?

Good question, because :

The bug caused full byte reversal of the inline string bytes, meaning the entire 12-byte segment was reversed before comparison.

For strings like "xyy" and "xyz", which differ only in their last byte, reversing the bytes moves this difference to the first byte of the reversed string.

Since comparisons are done on the reversed bytes for both strings, the order is consistently flipped but preserved between them.

Thus, even though the byte order is wrong globally (the entire string is reversed), "xyy" still compares correctly as less than "xyz" in the reversed space, so the test passes.

In other words, differences at the end of short strings don’t expose the reversal bug, because reversing the entire string simply moves the difference to the front, preserving the relative order.

The bug only becomes apparent in strings with differences in the middle or earlier bytes, like "backend one" vs "backend two", where reversing the entire inline data inverts the lexicographical order unexpectedly.

I guess because"xyy" < "xyz" and "yxx" < "zyx"?

I guess because"xyy" < "xyz" and "yxx" < "zyx"?

Right!

alamb · 2025-07-05T19:50:06Z

arrow-array/src/array/byte_view_array.rs

+            // without our fix, “backend one” would compare as if it were
+            //    “eno dnekcab”, so “one” might end up sorting _after_ “two”.
+            b"backend one", // special case: tests byte-order reversal bug
+            b"backend two",


Can we also please include strings that have the same exact inline prefix the length differ as well as content ?

Something like

LongerThan12Bytes

LongerThan12Bytez

LongerThan12Bytes\0

LongerThan12Byt

Also, given the endian swapping going on, can we please also include a few strings that are more than 256 bytes long (so the length requires 2 bytes to store)?

Thank you @alamb for good point:

b"than12Byt", b"than12Bytes", b"than12Bytes\0", b"than12Bytez",

I will add above cases, because inline_key_fast function only used for

<= 12 and <= 12 bytes to compare.

Added in latest PR for above testing case.

Thank you @alamb I also found a way to do it, i adjust the fuzz testing to reproduce the cases, and this PR will make the fuzz testing works well.

alamb · 2025-07-05T19:57:03Z

arrow-array/src/array/byte_view_array.rs

    /// ⇒ key("bar") < key("bar\0")
    /// ```
+    ///
+    /// ### Why the old code failed


I think it would be more helpful to future readers to explain here to explain how the calculation works, rather than explaining why a previous attempt didnt' work 🤔

Got it @alamb , thank you!

alamb · 2025-07-05T19:57:59Z

arrow-array/src/array/byte_view_array.rs

    ///
    /// ```text
    /// key("bar")   = 0x0000000000000000000062617200000003
    /// key("bar\0") = 0x0000000000000000000062617200000004


I think this diagram needs be updated as well

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

zhuqi-lucas · 2025-07-06T06:53:25Z

Thank you @zhuqi-lucas -- I verified this fixes the issue I was seeing in DataFusion 🙏

Given the subtlety here, I think a few more tests are warranted -- I left some more suggestions. Let's try and make sure we are 100% sure about this one

Thank you @alamb for review, added rich tests and comments in latest change!

alamb

Thank you @zhuqi-lucas -- I also pushed a few more test cases to ease my mind

alamb · 2025-07-06T10:46:38Z

arrow-array/src/array/byte_view_array.rs

+        //
+        //   [0x77, 0x56, 0x34, 0x22]  // little-endian layout
+        //    ^     ^     ^     ^
+        //  LSB   ↑↑↑           MSB


this is a very nice comment

zhuqi-lucas · 2025-07-06T11:10:01Z

Thank you @zhuqi-lucas -- I also pushed a few more test cases to ease my mind

Thank you @alamb , i will also port to datafusion fix for CursorValue compare after this PR merged.

alamb · 2025-07-06T13:44:11Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_7874 (2c2df16) to aef3bdd diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=issue_7874
Results will be posted here when complete

alamb · 2025-07-06T14:03:03Z

🤖: Benchmark completed

Details

group                                                   issue_7874                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.1±0.28µs        ? ?/sec    1.00    116.3±0.34µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.02    159.3±0.19µs        ? ?/sec    1.00    156.4±0.39µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     44.7±0.07µs        ? ?/sec    1.01     44.9±0.24µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    210.6±0.31µs        ? ?/sec    1.01    213.7±0.21µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.4±0.04µs        ? ?/sec    1.02     39.1±0.30µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     40.9±0.06µs        ? ?/sec    1.02     41.7±0.05µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.5±0.12µs        ? ?/sec    1.00     78.8±0.12µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    209.6±0.27µs        ? ?/sec    1.00    210.3±0.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     54.8±0.13µs        ? ?/sec    1.00     55.0±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    255.4±0.55µs        ? ?/sec    1.01    256.8±0.56µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     90.0±0.19µs        ? ?/sec    1.01     91.1±0.15µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     91.1±0.15µs        ? ?/sec    1.01     92.1±0.15µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00    102.3±0.17µs        ? ?/sec    1.01    103.7±0.16µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    256.0±0.65µs        ? ?/sec    1.00    256.4±0.46µs        ? ?/sec
rank f32 2^12                                           1.06     72.4±0.29µs        ? ?/sec    1.00     68.1±0.39µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.9±0.08µs        ? ?/sec    1.00     35.9±0.03µs        ? ?/sec
rank string[10] 2^12                                    1.00    251.3±0.29µs        ? ?/sec    1.03    259.2±0.33µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    121.5±0.20µs        ? ?/sec    1.01    122.8±0.28µs        ? ?/sec
sort f32 2^12                                           1.00     60.4±0.54µs        ? ?/sec    1.00     60.5±0.72µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     30.1±0.10µs        ? ?/sec    1.00     30.0±0.16µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     69.0±0.12µs        ? ?/sec    1.06     73.0±0.15µs        ? ?/sec
sort f32 to indices 2^12                                1.00     71.8±0.23µs        ? ?/sec    1.06     76.0±0.19µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.02µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.7±0.11µs        ? ?/sec    1.01     36.0±0.10µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.5±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     19.0±0.03µs        ? ?/sec    1.01     19.2±0.06µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.05     10.5±0.04µs        ? ?/sec    1.00     10.0±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.06     56.6±0.16µs        ? ?/sec    1.00     53.3±0.13µs        ? ?/sec
sort i32 to indices 2^10                                1.12     12.8±0.02µs        ? ?/sec    1.00     11.5±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.0±0.20µs        ? ?/sec    1.00     55.0±0.14µs        ? ?/sec
sort primitive run 2^12                                 1.02      6.5±0.01µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.01µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    177.0±0.34µs        ? ?/sec    1.00    177.8±0.34µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.5±0.78µs        ? ?/sec    1.00    332.8±1.03µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    147.0±0.25µs        ? ?/sec    1.02    149.4±0.24µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.0±0.38µs        ? ?/sec    1.00    260.5±0.59µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    156.9±0.25µs        ? ?/sec    1.00    157.7±0.28µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.01    284.7±0.91µs        ? ?/sec    1.00    281.8±0.83µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    147.4±0.31µs        ? ?/sec    1.00    147.4±0.37µs        ? ?/sec
sort string[1000] to indices 2^12                       1.01    251.9±1.17µs        ? ?/sec    1.00    249.4±1.39µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    142.8±0.28µs        ? ?/sec    1.02    146.1±0.57µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.5±0.43µs        ? ?/sec    1.00    248.1±0.54µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    180.1±0.55µs        ? ?/sec    1.01    181.7±0.28µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    320.3±0.66µs        ? ?/sec    1.00    321.0±0.67µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    144.6±0.22µs        ? ?/sec    1.01    146.0±0.28µs        ? ?/sec
sort string[10] to indices 2^12                         1.02    248.8±0.67µs        ? ?/sec    1.00    243.9±0.36µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.04     89.2±0.16µs        ? ?/sec    1.00     85.4±0.13µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.10    133.6±0.21µs        ? ?/sec    1.00    121.6±0.22µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.05     73.7±0.38µs        ? ?/sec    1.00     70.5±0.55µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.46µs        ? ?/sec    1.00    104.4±0.25µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.06     70.7±0.28µs        ? ?/sec    1.00     66.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.5±0.25µs        ? ?/sec    1.01     95.0±1.10µs        ? ?/sec

zhuqi-lucas · 2025-07-06T14:10:20Z

Thank you @alamb for benchmark, i think it's ok for the benchmark here, only less than 10% regression for some cases:

sort string_view[0-400] nulls to indices 2^12           1.04     89.2±0.16µs        ? ?/sec    1.00     85.4±0.13µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.10    133.6±0.21µs        ? ?/sec    1.00    121.6±0.22µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.05     73.7±0.38µs        ? ?/sec    1.00     70.5±0.55µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.46µs        ? ?/sec    1.00    104.4±0.25µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.06     70.7±0.28µs        ? ?/sec    1.00     66.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.5±0.25µs        ? ?/sec    1.01     95.0±1.10µs        ? ?/sec

Because we originally has improved the sort with main for total 2x ~5x faster for the following two PRs:
#7792
#7856

alamb · 2025-07-07T10:18:45Z

I agree -- thanks again @zhuqi-lucas

zhuqi-lucas · 2025-07-07T10:38:08Z

Thank you @alamb , submitted the port to datafusion:

apache/datafusion#16698

fix: Incorrect inlined string view comparison after Add prefix compar…

3dff4c1

…e for inlined

github-actions bot added the arrow Changes to the arrow crate label Jul 5, 2025

zhuqi-lucas added 2 commits July 5, 2025 23:16

Add fix pr

f5fb49a

fix doc

02c1e95

zhuqi-lucas added 2 commits July 6, 2025 00:11

performance optimization

3884cbb

polish

1abc59c

alamb mentioned this pull request Jul 5, 2025

Perf: fast CursorValues compare for StringViewArray using inline_key_… apache/datafusion#16630

Merged

alamb reviewed Jul 5, 2025

View reviewed changes

zhuqi-lucas and others added 2 commits July 6, 2025 11:53

Update arrow-array/src/array/byte_view_array.rs

35b0f4b

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Add rich tests and comments

8cf8949

zhuqi-lucas and others added 4 commits July 6, 2025 14:55

fmt

d3966b1

fuzz testing

3d2765a

Merge remote-tracking branch 'upstream/main' into issue_7874

07de90f

Add additional OCD tests

2c2df16

alamb approved these changes Jul 6, 2025

View reviewed changes

alamb merged commit df837a4 into apache:main Jul 7, 2025
26 checks passed

zhuqi-lucas mentioned this pull request Jul 7, 2025

fix: port arrow inline fast key fix to datafusion apache/datafusion#16698

Merged

fix: Incorrect inlined string view comparison after Add prefix compar… #7875

fix: Incorrect inlined string view comparison after Add prefix compar… #7875

Conversation

zhuqi-lucas commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Change Summary

Problem

Solution

Testing

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 6, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 6, 2025

Uh oh!

alamb commented Jul 6, 2025

Uh oh!

alamb commented Jul 6, 2025

Uh oh!

zhuqi-lucas commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 7, 2025

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhuqi-lucas commented Jul 5, 2025 •

edited

Loading

zhuqi-lucas commented Jul 6, 2025 •

edited

Loading