20278: perf: Optimize lpad, rpad for ASCII strings#242
20278: perf: Optimize lpad, rpad for ASCII strings#242martin-augment wants to merge 1 commit intomainfrom
Conversation
The previous implementation incurred the overhead of Unicode machinery,
even for the common case that both the input string and the fill string
consistent only of ASCII characters. For the ASCII-only case, we can
assume that the length in bytes equals the length in characters, and
avoid expensive graphene-based segmentation. This follows similar
optimizations applied elsewhere in the codebase.
Benchmarks indicate this is a significant performance win for ASCII-only
input (4x-10x faster) but only a mild regression for Unicode input (2-5%
slower).
Along the way:
* Combine: a few instances of `write_str(str)? + append_value("")` with
`append_value(str)`, which saves a few cycles
* Add a missing test case for truncating the input string
* Add benchmarks for Unicode input
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request enhances the lpad and rpad string padding functions in DataFusion. It introduces ASCII fast-path optimizations in both functions to bypass grapheme segmentation when inputs are ASCII-compatible, improving performance for common cases. Documentation for both functions is updated to clarify that strings exceeding the target length are truncated on the right. Benchmarking infrastructure is extended with Unicode-aware test helpers and scenarios to measure performance across ASCII and Unicode string inputs in multiple configurations. The public API remains unchanged, with modifications limited to internal implementation and documentation. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Optimize lpad, rpad for ASCII stringsOverall this is a well-motivated optimization. Skipping Unicode grapheme segmentation for ASCII-only inputs is a sound approach, and the latent Bugs / Correctness Issues1.
|
| Priority | Issue |
|---|---|
| Medium | #1: rpad fill remainder=0 — inconsistent with lpad pattern; fragile |
| Low | #2: String::repeat allocates in the no-fill fast path |
| Low | #3: Doc inconsistency between lpad and rpad truncation description |
| Low | #4: length == str_len does unnecessary arithmetic |
| Suggestion | #5/#6: Add tests for remainder==0 and mixed ASCII/Unicode inputs |
The core optimization is correct and the latent write_str/append_value bug fixes are solid. Addressing the inconsistency in issue #1 and adding the suggested tests would make this production-ready.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
datafusion/functions/src/unicode/lpad.rs (1)
295-316: Allocation on every padded row:" ".repeat(...)creates a heapString.In the ASCII fast path (and the Unicode fallback),
" ".repeat(length - str_len)allocates a newStringper row. For a space-only fill, you could avoid this by writing spaces in a loop or pre-allocating a reusable buffer, similar to how the fill-string path avoids per-row allocation.This is a minor optimization opportunity and not a blocker — the fast path is already a significant improvement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/unicode/lpad.rs` around lines 295 - 316, The code currently allocates a new String per row with " ".repeat(...) in both the ASCII fast path (inside the string.is_ascii() branch) and the Unicode fallback; avoid per-row heap allocations by writing spaces from a reusable buffer or loop instead. Modify the lpad logic to use a preallocated/mutable space buffer (e.g., a spaces_buf in the outer scope) that is grown once to the max needed length and then call builder.write_str(&spaces_buf[..n]) or emit n spaces with a short loop, replacing the two uses of " ".repeat(...); update the branches around string.is_ascii(), builder.write_str, graphemes_buf, and append_value to use this reusable buffer so no new String is created per row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/source/user-guide/sql/scalar_functions.md`:
- Line 1823: Update the rpad documentation entry so its truncation direction
matches lpad by appending "(on the right)" to the sentence describing
truncation; locate the rpad description (symbol: rpad) and mirror the phrasing
used in the lpad documentation (symbol: lpad) to read "If the input string is
longer than this length, it is truncated (on the right)."
---
Nitpick comments:
In `@datafusion/functions/src/unicode/lpad.rs`:
- Around line 295-316: The code currently allocates a new String per row with "
".repeat(...) in both the ASCII fast path (inside the string.is_ascii() branch)
and the Unicode fallback; avoid per-row heap allocations by writing spaces from
a reusable buffer or loop instead. Modify the lpad logic to use a
preallocated/mutable space buffer (e.g., a spaces_buf in the outer scope) that
is grown once to the max needed length and then call
builder.write_str(&spaces_buf[..n]) or emit n spaces with a short loop,
replacing the two uses of " ".repeat(...); update the branches around
string.is_ascii(), builder.write_str, graphemes_buf, and append_value to use
this reusable buffer so no new String is created per row.
|
|
||
| - **str**: String expression to operate on. Can be a constant, column, or function, and any combination of operators. | ||
| - **n**: String length to pad to. | ||
| - **n**: String length to pad to. If the input string is longer than this length, it is truncated. |
There was a problem hiding this comment.
Minor inconsistency: rpad truncation doc omits direction, unlike lpad.
The lpad doc at Line 1595 says "it is truncated (on the right)" but the rpad doc here says only "it is truncated." For consistency and clarity, consider adding "(on the right)" here as well, matching the lpad wording.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/source/user-guide/sql/scalar_functions.md` at line 1823, Update the rpad
documentation entry so its truncation direction matches lpad by appending "(on
the right)" to the sentence describing truncation; locate the rpad description
(symbol: rpad) and mirror the phrasing used in the lpad documentation (symbol:
lpad) to read "If the input string is longer than this length, it is truncated
(on the right)."
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess.
🤖 Augment PR SummarySummary: Improves performance of the Unicode Changes:
🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for lpad and rpad functions by adding a fast path for ASCII strings, including new benchmarks for Unicode strings and updated documentation. No security vulnerabilities were found. However, critical issues were identified related to unhandled Result return types in the new implementation, which could lead to silent failures. Suggestions have been provided to fix these. Additionally, there are opportunities to further improve performance by avoiding string allocations and to reduce code duplication in the benchmark helpers.
| builder.write_str(string)?; | ||
| builder.append_value( | ||
| " ".repeat(length - str_len).as_str(), | ||
| ); |
There was a problem hiding this comment.
The Result from builder.append_value is not handled, which can hide errors. This is a critical bug. You should propagate the error using the ? operator.
Additionally, using " ".repeat(...) can be inefficient as it allocates a new String for padding. Consider writing spaces character by character in a loop to avoid this allocation.
builder.write_str(string)?;
for _ in 0..(length - str_len) {
builder.write_char(' ')?;
}
builder.append_value("")?;There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path.
| builder.write_str(string)?; | ||
| builder.write_str( | ||
| builder.append_value( | ||
| &" ".repeat(length - graphemes_buf.len()), | ||
| )?; | ||
| builder.append_value(""); | ||
| ); |
There was a problem hiding this comment.
The Result from builder.append_value is not handled, which is a critical bug. You should use the ? operator to propagate potential errors.
Also, " ".repeat(...) allocates a new String, which is inefficient. It's better to write the padding spaces in a loop.
builder.write_str(string)?;
for _ in 0..(length - graphemes_buf.len()) {
builder.write_char(' ')?;
}
builder.append_value("")?;There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path.
| for _ in 0..full_reps { | ||
| builder.write_str(fill)?; | ||
| } | ||
| builder.append_value(&fill[..remainder]); |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several places where an append_xyz() method is called without checking it for an Err result. This may hide a problem and continue using the builder without some earlier data.
| .unwrap(); | ||
| builder.write_char(c)?; | ||
| } | ||
| builder.append_value(""); |
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several places where an append_xyz() method is called without checking it for an Err result. This may hide a problem and continue using the builder without some earlier data.
| fn create_unicode_string_array<O: OffsetSizeTrait>( | ||
| size: usize, | ||
| null_density: f32, | ||
| ) -> arrow::array::GenericStringArray<O> { | ||
| let mut rng = rand::rng(); | ||
| let mut builder = GenericStringBuilder::<O>::new(); | ||
| for i in 0..size { | ||
| if rng.random::<f32>() < null_density { | ||
| builder.append_null(); | ||
| } else { | ||
| builder.append_value(UNICODE_STRINGS[i % UNICODE_STRINGS.len()]); | ||
| } | ||
| } | ||
| builder.finish() | ||
| } | ||
|
|
||
| fn create_unicode_string_view_array( | ||
| size: usize, | ||
| null_density: f32, | ||
| ) -> arrow::array::StringViewArray { | ||
| let mut rng = rand::rng(); | ||
| let mut builder = StringViewBuilder::with_capacity(size); | ||
| for i in 0..size { | ||
| if rng.random::<f32>() < null_density { | ||
| builder.append_null(); | ||
| } else { | ||
| builder.append_value(UNICODE_STRINGS[i % UNICODE_STRINGS.len()]); | ||
| } | ||
| } | ||
| builder.finish() | ||
| } |
There was a problem hiding this comment.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The two functions are almost identical and they could be merged into a single more generic one and reused.
| if use_string_view { | ||
| let string_array = create_unicode_string_view_array(size, 0.1); | ||
| let fill_array = create_unicode_string_view_array(size, 0.1); | ||
| vec![ | ||
| ColumnarValue::Array(Arc::new(string_array)), | ||
| ColumnarValue::Array(length_array), | ||
| ColumnarValue::Array(Arc::new(fill_array)), | ||
| ] | ||
| } else { | ||
| let string_array = create_unicode_string_array::<i32>(size, 0.1); | ||
| let fill_array = create_unicode_string_array::<i32>(size, 0.1); | ||
| vec![ | ||
| ColumnarValue::Array(Arc::new(string_array)), | ||
| ColumnarValue::Array(length_array), | ||
| ColumnarValue::Array(Arc::new(fill_array)), | ||
| ] | ||
| } |
There was a problem hiding this comment.
The if/else blocks contain duplicated logic for constructing the Vec<ColumnarValue>. You can refactor this to reduce code duplication by creating the string and fill arrays within the branches and then constructing the vector once.
| if use_string_view { | |
| let string_array = create_unicode_string_view_array(size, 0.1); | |
| let fill_array = create_unicode_string_view_array(size, 0.1); | |
| vec![ | |
| ColumnarValue::Array(Arc::new(string_array)), | |
| ColumnarValue::Array(length_array), | |
| ColumnarValue::Array(Arc::new(fill_array)), | |
| ] | |
| } else { | |
| let string_array = create_unicode_string_array::<i32>(size, 0.1); | |
| let fill_array = create_unicode_string_array::<i32>(size, 0.1); | |
| vec![ | |
| ColumnarValue::Array(Arc::new(string_array)), | |
| ColumnarValue::Array(length_array), | |
| ColumnarValue::Array(Arc::new(fill_array)), | |
| ] | |
| } | |
| let (string_array, fill_array): (arrow::array::ArrayRef, arrow::array::ArrayRef) = if use_string_view { | |
| ( | |
| std::sync::Arc::new(create_unicode_string_view_array(size, 0.1)), | |
| std::sync::Arc::new(create_unicode_string_view_array(size, 0.1)), | |
| ) | |
| } else { | |
| ( | |
| std::sync::Arc::new(create_unicode_string_array::<i32>(size, 0.1)), | |
| std::sync::Arc::new(create_unicode_string_array::<i32>(size, 0.1)), | |
| ) | |
| }; | |
| vec![ | |
| ColumnarValue::Array(string_array), | |
| ColumnarValue::Array(length_array), | |
| ColumnarValue::Array(fill_array), | |
| ] |
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The logic for constructing the ColumnarValues could be simplified. it will be both easier to maintain and it will be more performant.
|
|
||
| - **str**: String expression to operate on. Can be a constant, column, or function, and any combination of operators. | ||
| - **n**: String length to pad to. | ||
| - **n**: String length to pad to. If the input string is longer than this length, it is truncated. |
There was a problem hiding this comment.
For consistency with lpad's documentation and to be more precise, it would be good to specify that truncation happens on the right.
| - **n**: String length to pad to. If the input string is longer than this length, it is truncated. | |
| - **n**: String length to pad to. If the input string is longer than this length, it is truncated (on the right). |
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess.
value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess. |
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path. |
20278: To review by AI