Skip to content

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jul 5, 2025

…e for inlined

Which issue does this PR close?

Rationale for this change

Change Summary

Rework inline_key_fast to avoid reversing the inline data bytes by removing the global .to_be() on the entire 128‑bit word and instead manually constructing the big‑endian key in two parts: the 96‑bit data portion and the 32‑bit length tiebreaker.


Problem

In the original implementation:

let inline_u128 = u128::from_le_bytes(raw_bytes).to_be();
  • What went wrong: Calling .to_be() on the full 16‑byte value flips all bytes, including the 12 bytes of inline data.
  • Consequences: Multi‑byte strings are compared in reverse order — e.g. "backend one" would sort as if it were "eno dnekcab" — so lexicographical ordering is completely inverted.
  • Corner cases exposed:
    “backend one” vs. “backend two”: suffixes “one”/“two” compare incorrectly once reversed.

Solution

#[inline(always)]
    pub fn inline_key_fast(raw: u128) -> u128 {
        // 1. Decompose `raw` into little‑endian bytes:
        //    - raw_bytes[0..4]  = length in LE
        //    - raw_bytes[4..16] = inline string data
        let raw_bytes = raw.to_le_bytes();

        // 2. Numerically truncate to get the low 32‑bit length (endianness‑free).
        let length = raw as u32;

        // 3. Build a 16‑byte buffer in big‑endian order:
        //    - buf[0..12]  = inline string bytes (in original order)
        //    - buf[12..16] = length.to_be_bytes() (BE)
        let mut buf = [0u8; 16];
        buf[0..12].copy_from_slice(&raw_bytes[4..16]); // inline data

        // Why convert length to big-endian for comparison?
        //
        // Rust (on most platforms) stores integers in little-endian format,
        // meaning the least significant byte is at the lowest memory address.
        // For example, an u32 value like 0x22345677 is stored in memory as:
        //
        //   [0x77, 0x56, 0x34, 0x22]  // little-endian layout
        //    ^     ^     ^     ^
        //  LSB   ↑↑↑           MSB
        //
        // This layout is efficient for arithmetic but *not* suitable for
        // lexicographic (dictionary-style) comparison of byte arrays.
        //
        // To compare values by byte order—e.g., for sorted keys or binary trees—
        // we must convert them to **big-endian**, where:
        //
        //   - The most significant byte (MSB) comes first (index 0)
        //   - The least significant byte (LSB) comes last (index N-1)
        //
        // In big-endian, the same u32 = 0x22345677 would be represented as:
        //
        //   [0x22, 0x34, 0x56, 0x77]
        //
        // This ordering aligns with natural string/byte sorting, so calling
        // `.to_be_bytes()` allows us to construct
        // keys where standard numeric comparison (e.g., `<`, `>`) behaves
        // like lexicographic byte comparison.
        buf[12..16].copy_from_slice(&length.to_be_bytes()); // length in BE

        // 4. Deserialize the buffer as a big‑endian u128:
        //    buf[0] is MSB, buf[15] is LSB.
        // Details:
        // Note on endianness and layout:
        //
        // Although `buf[0]` is stored at the lowest memory address,
        // calling `u128::from_be_bytes(buf)` interprets it as the **most significant byte (MSB)**,
        // and `buf[15]` as the **least significant byte (LSB)**.
        //
        // This is the core principle of **big-endian decoding**:
        //   - Byte at index 0 maps to bits 127..120 (highest)
        //   - Byte at index 1 maps to bits 119..112
        //   - ...
        //   - Byte at index 15 maps to bits 7..0 (lowest)
        //
        // So even though memory layout goes from low to high (left to right),
        // big-endian treats the **first byte** as highest in value.
        //
        // This guarantees that comparing two `u128` keys is equivalent to lexicographically
        // comparing the original inline bytes, followed by length.
        u128::from_be_bytes(buf)
    }

Testing

All existing tests — including the “backend one” vs. “backend two” and "bar" vs. "bar\0" cases — now pass, confirming both lexicographical correctness and proper length‑based tiebreaking.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 5, 2025
@zhuqi-lucas
Copy link
Contributor Author

cc @alamb @Dandandan

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- I verified this fixes the issue I was seeing in DataFusion 🙏

Given the subtlety here, I think a few more tests are warranted -- I left some more suggestions. Let's try and make sure we are 100% sure about this one

// This pair verifies that we didn’t accidentally reverse the inline bytes:
// without our fix, “backend one” would compare as if it were
// “eno dnekcab”, so “one” might end up sorting _after_ “two”.
b"backend one", // special case: tests byte-order reversal bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this wasn't this caught withthe xyy and xyz test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, because :

The bug caused full byte reversal of the inline string bytes, meaning the entire 12-byte segment was reversed before comparison.

For strings like "xyy" and "xyz", which differ only in their last byte, reversing the bytes moves this difference to the first byte of the reversed string.

Since comparisons are done on the reversed bytes for both strings, the order is consistently flipped but preserved between them.

Thus, even though the byte order is wrong globally (the entire string is reversed), "xyy" still compares correctly as less than "xyz" in the reversed space, so the test passes.

In other words, differences at the end of short strings don’t expose the reversal bug, because reversing the entire string simply moves the difference to the front, preserving the relative order.

The bug only becomes apparent in strings with differences in the middle or earlier bytes, like "backend one" vs "backend two", where reversing the entire inline data inverts the lexicographical order unexpectedly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because"xyy" < "xyz" and "yxx" < "zyx"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because"xyy" < "xyz" and "yxx" < "zyx"?

Right!

// without our fix, “backend one” would compare as if it were
// “eno dnekcab”, so “one” might end up sorting _after_ “two”.
b"backend one", // special case: tests byte-order reversal bug
b"backend two",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also please include strings that have the same exact inline prefix the length differ as well as content ?

Something like

  • LongerThan12Bytes
  • LongerThan12Bytez
  • LongerThan12Bytes\0
  • LongerThan12Byt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, given the endian swapping going on, can we please also include a few strings that are more than 256 bytes long (so the length requires 2 bytes to store)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for good point:

            b"than12Byt",
            b"than12Bytes",
            b"than12Bytes\0",
            b"than12Bytez",

I will add above cases, because inline_key_fast function only used for

<= 12 and <= 12 bytes to compare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in latest PR for above testing case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb I also found a way to do it, i adjust the fuzz testing to reproduce the cases, and this PR will make the fuzz testing works well.

/// ⇒ key("bar") < key("bar\0")
/// ```
///
/// ### Why the old code failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more helpful to future readers to explain here to explain how the calculation works, rather than explaining why a previous attempt didnt' work 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it @alamb , thank you!

///
/// ```text
/// key("bar") = 0x0000000000000000000062617200000003
/// key("bar\0") = 0x0000000000000000000062617200000004
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diagram needs be updated as well

zhuqi-lucas and others added 2 commits July 6, 2025 11:53
@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- I verified this fixes the issue I was seeing in DataFusion 🙏

Given the subtlety here, I think a few more tests are warranted -- I left some more suggestions. Let's try and make sure we are 100% sure about this one

Thank you @alamb for review, added rich tests and comments in latest change!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- I also pushed a few more test cases to ease my mind

//
// [0x77, 0x56, 0x34, 0x22] // little-endian layout
// ^ ^ ^ ^
// LSB ↑↑↑ MSB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very nice comment

@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- I also pushed a few more test cases to ease my mind

Thank you @alamb , i will also port to datafusion fix for CursorValue compare after this PR merged.

@alamb
Copy link
Contributor

alamb commented Jul 6, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_7874 (2c2df16) to aef3bdd diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=issue_7874
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 6, 2025

🤖: Benchmark completed

Details

group                                                   issue_7874                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.1±0.28µs        ? ?/sec    1.00    116.3±0.34µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.02    159.3±0.19µs        ? ?/sec    1.00    156.4±0.39µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     44.7±0.07µs        ? ?/sec    1.01     44.9±0.24µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    210.6±0.31µs        ? ?/sec    1.01    213.7±0.21µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.4±0.04µs        ? ?/sec    1.02     39.1±0.30µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.00     40.9±0.06µs        ? ?/sec    1.02     41.7±0.05µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.5±0.12µs        ? ?/sec    1.00     78.8±0.12µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    209.6±0.27µs        ? ?/sec    1.00    210.3±0.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.00     54.8±0.13µs        ? ?/sec    1.00     55.0±0.10µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.00    255.4±0.55µs        ? ?/sec    1.01    256.8±0.56µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.00     90.0±0.19µs        ? ?/sec    1.01     91.1±0.15µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.00     91.1±0.15µs        ? ?/sec    1.01     92.1±0.15µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.00    102.3±0.17µs        ? ?/sec    1.01    103.7±0.16µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.00    256.0±0.65µs        ? ?/sec    1.00    256.4±0.46µs        ? ?/sec
rank f32 2^12                                           1.06     72.4±0.29µs        ? ?/sec    1.00     68.1±0.39µs        ? ?/sec
rank f32 nulls 2^12                                     1.06     37.9±0.08µs        ? ?/sec    1.00     35.9±0.03µs        ? ?/sec
rank string[10] 2^12                                    1.00    251.3±0.29µs        ? ?/sec    1.03    259.2±0.33µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    121.5±0.20µs        ? ?/sec    1.01    122.8±0.28µs        ? ?/sec
sort f32 2^12                                           1.00     60.4±0.54µs        ? ?/sec    1.00     60.5±0.72µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     30.1±0.10µs        ? ?/sec    1.00     30.0±0.16µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     69.0±0.12µs        ? ?/sec    1.06     73.0±0.15µs        ? ?/sec
sort f32 to indices 2^12                                1.00     71.8±0.23µs        ? ?/sec    1.06     76.0±0.19µs        ? ?/sec
sort i32 2^10                                           1.00      7.3±0.02µs        ? ?/sec    1.00      7.3±0.01µs        ? ?/sec
sort i32 2^12                                           1.00     35.7±0.11µs        ? ?/sec    1.01     36.0±0.10µs        ? ?/sec
sort i32 nulls 2^10                                     1.00      4.5±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.00     19.0±0.03µs        ? ?/sec    1.01     19.2±0.06µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.05     10.5±0.04µs        ? ?/sec    1.00     10.0±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.06     56.6±0.16µs        ? ?/sec    1.00     53.3±0.13µs        ? ?/sec
sort i32 to indices 2^10                                1.12     12.8±0.02µs        ? ?/sec    1.00     11.5±0.03µs        ? ?/sec
sort i32 to indices 2^12                                1.15     63.0±0.20µs        ? ?/sec    1.00     55.0±0.14µs        ? ?/sec
sort primitive run 2^12                                 1.02      6.5±0.01µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.9±0.01µs        ? ?/sec    1.00      8.9±0.02µs        ? ?/sec
sort string[0-100] nulls to indices 2^12                1.00    177.0±0.34µs        ? ?/sec    1.00    177.8±0.34µs        ? ?/sec
sort string[0-100] to indices 2^12                      1.00    332.5±0.78µs        ? ?/sec    1.00    332.8±1.03µs        ? ?/sec
sort string[0-10] nulls to indices 2^12                 1.00    147.0±0.25µs        ? ?/sec    1.02    149.4±0.24µs        ? ?/sec
sort string[0-10] to indices 2^12                       1.00    260.0±0.38µs        ? ?/sec    1.00    260.5±0.59µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    156.9±0.25µs        ? ?/sec    1.00    157.7±0.28µs        ? ?/sec
sort string[0-400] to indices 2^12                      1.01    284.7±0.91µs        ? ?/sec    1.00    281.8±0.83µs        ? ?/sec
sort string[1000] nulls to indices 2^12                 1.00    147.4±0.31µs        ? ?/sec    1.00    147.4±0.37µs        ? ?/sec
sort string[1000] to indices 2^12                       1.01    251.9±1.17µs        ? ?/sec    1.00    249.4±1.39µs        ? ?/sec
sort string[100] nulls to indices 2^12                  1.00    142.8±0.28µs        ? ?/sec    1.02    146.1±0.57µs        ? ?/sec
sort string[100] to indices 2^12                        1.00    248.5±0.43µs        ? ?/sec    1.00    248.1±0.54µs        ? ?/sec
sort string[10] dict nulls to indices 2^12              1.00    180.1±0.55µs        ? ?/sec    1.01    181.7±0.28µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    320.3±0.66µs        ? ?/sec    1.00    321.0±0.67µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.00    144.6±0.22µs        ? ?/sec    1.01    146.0±0.28µs        ? ?/sec
sort string[10] to indices 2^12                         1.02    248.8±0.67µs        ? ?/sec    1.00    243.9±0.36µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.04     89.2±0.16µs        ? ?/sec    1.00     85.4±0.13µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.10    133.6±0.21µs        ? ?/sec    1.00    121.6±0.22µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.05     73.7±0.38µs        ? ?/sec    1.00     70.5±0.55µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.46µs        ? ?/sec    1.00    104.4±0.25µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.06     70.7±0.28µs        ? ?/sec    1.00     66.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.5±0.25µs        ? ?/sec    1.01     95.0±1.10µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 6, 2025

Thank you @alamb for benchmark, i think it's ok for the benchmark here, only less than 10% regression for some cases:

sort string_view[0-400] nulls to indices 2^12           1.04     89.2±0.16µs        ? ?/sec    1.00     85.4±0.13µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.10    133.6±0.21µs        ? ?/sec    1.00    121.6±0.22µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.05     73.7±0.38µs        ? ?/sec    1.00     70.5±0.55µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.00    104.2±0.46µs        ? ?/sec    1.00    104.4±0.25µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.06     70.7±0.28µs        ? ?/sec    1.00     66.6±0.26µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.00     94.5±0.25µs        ? ?/sec    1.01     95.0±1.10µs        ? ?/sec

Because we originally has improved the sort with main for total 2x ~5x faster for the following two PRs:
#7792
#7856

@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

I agree -- thanks again @zhuqi-lucas

@alamb alamb merged commit df837a4 into apache:main Jul 7, 2025
26 checks passed
@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb , submitted the port to datafusion:

apache/datafusion#16698

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect inlined string view comparison after " Add prefix compare for inlined"

3 participants