Skip to content

20278: perf: Optimize lpad, rpad for ASCII strings#242

Open
martin-augment wants to merge 1 commit intomainfrom
pr-20278-2026-02-17-09-56-29
Open

20278: perf: Optimize lpad, rpad for ASCII strings#242
martin-augment wants to merge 1 commit intomainfrom
pr-20278-2026-02-17-09-56-29

Conversation

@martin-augment
Copy link
Owner

20278: To review by AI

The previous implementation incurred the overhead of Unicode machinery,
even for the common case that both the input string and the fill string
consistent only of ASCII characters. For the ASCII-only case, we can
assume that the length in bytes equals the length in characters, and
avoid expensive graphene-based segmentation. This follows similar
optimizations applied elsewhere in the codebase.

Benchmarks indicate this is a significant performance win for ASCII-only
input (4x-10x faster) but only a mild regression for Unicode input (2-5%
slower).

Along the way:

* Combine: a few instances of `write_str(str)? + append_value("")` with
  `append_value(str)`, which saves a few cycles
* Add a missing test case for truncating the input string
* Add benchmarks for Unicode input
@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the lpad and rpad string manipulation functions by implementing an optimized code path for ASCII strings. This optimization avoids the overhead of Unicode grapheme segmentation when not necessary, leading to faster execution for common ASCII inputs. The changes also include expanded benchmark coverage to validate performance with Unicode data, and improved documentation to clearly define the truncation behavior of these functions.

Highlights

  • Performance Optimization for ASCII Strings: Introduced a fast path for lpad and rpad functions when processing ASCII strings, bypassing expensive Unicode grapheme segmentation for improved performance.
  • Enhanced Benchmarking: Added new benchmark cases for lpad and rpad using Unicode strings to ensure performance and correctness across different character sets and string types (Utf8 and StringView).
  • Documentation Clarification: Updated the documentation for lpad and rpad to explicitly state that input strings are truncated if the target length n is shorter than the original string.
  • New Test Cases: Included additional test cases to verify the truncation behavior of lpad and rpad functions, particularly for ASCII strings.
Changelog
  • datafusion/functions/benches/pad.rs
    • Added GenericStringBuilder and StringViewBuilder imports for new array types.
    • Introduced UNICODE_STRINGS constant and helper functions (create_unicode_string_array, create_unicode_string_view_array) to generate Unicode test data.
    • Implemented create_unicode_pad_args to prepare arguments for Unicode pad benchmarks.
    • Integrated new benchmark groups for lpad and rpad with Unicode strings, testing both Utf8 and StringView types.
  • datafusion/functions/src/unicode/lpad.rs
    • Updated the n argument description in the function documentation to clarify truncation behavior.
    • Implemented an ASCII fast path within the lpad_impl function for GenericStringArray and StringViewArray to optimize performance.
    • Added a new test case for lpad to verify truncation of ASCII strings.
  • datafusion/functions/src/unicode/rpad.rs
    • Updated the n argument description in the function documentation to clarify truncation behavior.
    • Clarified the doc comment for rpad_impl regarding truncation.
    • Implemented an ASCII fast path within the rpad_impl function for GenericStringArray and StringViewArray to optimize performance.
    • Added a new test case for rpad to verify truncation of ASCII strings.
  • docs/source/user-guide/sql/scalar_functions.md
    • Updated the description for the n argument in the lpad function to explicitly mention truncation.
    • Updated the description for the n argument in the rpad function to explicitly mention truncation.
Activity
  • The pull request description explicitly states "To review by AI", indicating that no human review or comments have occurred yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

Walkthrough

This pull request enhances the lpad and rpad string padding functions in DataFusion. It introduces ASCII fast-path optimizations in both functions to bypass grapheme segmentation when inputs are ASCII-compatible, improving performance for common cases. Documentation for both functions is updated to clarify that strings exceeding the target length are truncated on the right. Benchmarking infrastructure is extended with Unicode-aware test helpers and scenarios to measure performance across ASCII and Unicode string inputs in multiple configurations. The public API remains unchanged, with modifications limited to internal implementation and documentation.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20278-2026-02-17-09-56-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 17, 2026

Code Review: Optimize lpad, rpad for ASCII strings

Overall this is a well-motivated optimization. Skipping Unicode grapheme segmentation for ASCII-only inputs is a sound approach, and the latent write_str/append_value bug fixes are valuable. I have some specific concerns below.


Bugs / Correctness Issues

1. rpad with fill: remainder == 0 pattern is inconsistent with lpad

In rpad.rs, the ASCII fast path with a fill string ends with:

builder.write_str(string)?;
for _ in 0..full_reps {
    builder.write_str(fill)?;
}
builder.append_value(&fill[..remainder]);   // remainder may be 0

When pad_len is an exact multiple of fill_len, remainder == 0 and &fill[..0] is "". Calling append_value("") finalizes the value correctly, but it is surprising and fragile. The lpad ASCII fast path correctly guards with if remainder > 0 { builder.write_str(...)?; } before builder.append_value(string). The rpad path should follow the same pattern:

builder.write_str(string)?;
for _ in 0..full_reps {
    builder.write_str(fill)?;
}
if remainder > 0 {
    builder.write_str(&fill[..remainder])?;
}
builder.append_value("");

2. String::repeat allocates in the hot loop (no-fill paths)

Both lpad and rpad no-fill ASCII fast paths still do:

builder.write_str(" ".repeat(length - str_len).as_str())?;

String::repeat allocates. Since the existing Unicode branch does the same thing this is not a regression, but since this is supposed to be the fast path, consider building a reusable spaces buffer once outside the row loop (or writing spaces in a small fixed-size chunk loop) to avoid per-row allocation.


Logic / Edge Cases

3. Doc inconsistency between lpad and rpad on truncation description

lpad doc: "it is truncated (on the right)."
rpad doc: "it is truncated."

Both truncate to the leftmost n characters. Either add the same parenthetical to both, or remove it from both. Suggested neutral wording: "the string is truncated to n characters".

4. length == str_len does unnecessary arithmetic

When length == str_len, the padding branch is reached, pad_len = 0, full_reps = 0, remainder = 0, and append_value(string) is called — correct but avoidable. An early exit before the branches would be clearer:

if length == str_len {
    builder.append_value(string);
    continue;
}

Test Coverage

5. No tests covering remainder == 0 in the ASCII fill path

The new tests cover truncation but not the full_reps/remainder arithmetic. Adding:

  • lpad("hi", 6, "xy")"xyxyhi" (exact multiple, remainder == 0)
  • rpad("hi", 6, "xy")"hixyxy" (exact multiple, remainder == 0)

would catch the fragile append_value(&fill[..0]) pattern in issue #1 if it were ever broken.

6. No test for mixed ASCII/Unicode input (fast path should NOT activate)

A test like lpad("héllo", 7, "ab") verifies that when either string or fill is non-ASCII, the Unicode grapheme path is still used. Without this, a future refactor could accidentally widen the ASCII fast-path condition.


Summary

Priority Issue
Medium #1: rpad fill remainder=0 — inconsistent with lpad pattern; fragile
Low #2: String::repeat allocates in the no-fill fast path
Low #3: Doc inconsistency between lpad and rpad truncation description
Low #4: length == str_len does unnecessary arithmetic
Suggestion #5/#6: Add tests for remainder==0 and mixed ASCII/Unicode inputs

The core optimization is correct and the latent write_str/append_value bug fixes are solid. Addressing the inconsistency in issue #1 and adding the suggested tests would make this production-ready.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datafusion/functions/src/unicode/lpad.rs (1)

295-316: Allocation on every padded row: " ".repeat(...) creates a heap String.

In the ASCII fast path (and the Unicode fallback), " ".repeat(length - str_len) allocates a new String per row. For a space-only fill, you could avoid this by writing spaces in a loop or pre-allocating a reusable buffer, similar to how the fill-string path avoids per-row allocation.

This is a minor optimization opportunity and not a blocker — the fast path is already a significant improvement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/unicode/lpad.rs` around lines 295 - 316, The code
currently allocates a new String per row with " ".repeat(...) in both the ASCII
fast path (inside the string.is_ascii() branch) and the Unicode fallback; avoid
per-row heap allocations by writing spaces from a reusable buffer or loop
instead. Modify the lpad logic to use a preallocated/mutable space buffer (e.g.,
a spaces_buf in the outer scope) that is grown once to the max needed length and
then call builder.write_str(&spaces_buf[..n]) or emit n spaces with a short
loop, replacing the two uses of " ".repeat(...); update the branches around
string.is_ascii(), builder.write_str, graphemes_buf, and append_value to use
this reusable buffer so no new String is created per row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/source/user-guide/sql/scalar_functions.md`:
- Line 1823: Update the rpad documentation entry so its truncation direction
matches lpad by appending "(on the right)" to the sentence describing
truncation; locate the rpad description (symbol: rpad) and mirror the phrasing
used in the lpad documentation (symbol: lpad) to read "If the input string is
longer than this length, it is truncated (on the right)."

---

Nitpick comments:
In `@datafusion/functions/src/unicode/lpad.rs`:
- Around line 295-316: The code currently allocates a new String per row with "
".repeat(...) in both the ASCII fast path (inside the string.is_ascii() branch)
and the Unicode fallback; avoid per-row heap allocations by writing spaces from
a reusable buffer or loop instead. Modify the lpad logic to use a
preallocated/mutable space buffer (e.g., a spaces_buf in the outer scope) that
is grown once to the max needed length and then call
builder.write_str(&spaces_buf[..n]) or emit n spaces with a short loop,
replacing the two uses of " ".repeat(...); update the branches around
string.is_ascii(), builder.write_str, graphemes_buf, and append_value to use
this reusable buffer so no new String is created per row.


- **str**: String expression to operate on. Can be a constant, column, or function, and any combination of operators.
- **n**: String length to pad to.
- **n**: String length to pad to. If the input string is longer than this length, it is truncated.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor inconsistency: rpad truncation doc omits direction, unlike lpad.

The lpad doc at Line 1595 says "it is truncated (on the right)" but the rpad doc here says only "it is truncated." For consistency and clarity, consider adding "(on the right)" here as well, matching the lpad wording.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/user-guide/sql/scalar_functions.md` at line 1823, Update the rpad
documentation entry so its truncation direction matches lpad by appending "(on
the right)" to the sentence describing truncation; locate the rpad description
(symbol: rpad) and mirror the phrasing used in the lpad documentation (symbol:
lpad) to read "If the input string is longer than this length, it is truncated
(on the right)."

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The CodeRabbit AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess.

@augmentcode
Copy link

augmentcode bot commented Feb 17, 2026

🤖 Augment PR Summary

Summary: Improves performance of the Unicode lpad/rpad scalar functions by adding an ASCII-only fast path.

Changes:

  • Add ASCII short-circuit logic in `lpad` and `rpad` to avoid grapheme segmentation when both input and padding are ASCII
  • Clarify documentation that when the requested length is shorter than the input, the result is truncated (keeping the left side)
  • Add regression tests covering truncation behavior for short target lengths
  • Extend pad benchmarks to include Unicode string inputs for both Utf8 and Utf8View arrays
Technical Notes: The fast path relies on the invariant that ASCII byte length equals character length; non-ASCII inputs continue using grapheme-aware logic.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for lpad and rpad functions by adding a fast path for ASCII strings, including new benchmarks for Unicode strings and updated documentation. No security vulnerabilities were found. However, critical issues were identified related to unhandled Result return types in the new implementation, which could lead to silent failures. Suggestions have been provided to fix these. Additionally, there are opportunities to further improve performance by avoiding string allocations and to reduce code duplication in the benchmark helpers.

Comment on lines +247 to +250
builder.write_str(string)?;
builder.append_value(
" ".repeat(length - str_len).as_str(),
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Result from builder.append_value is not handled, which can hide errors. This is a critical bug. You should propagate the error using the ? operator.

Additionally, using " ".repeat(...) can be inefficient as it allocates a new String for padding. Consider writing spaces character by character in a loop to avoid this allocation.

                                    builder.write_str(string)?;
                                    for _ in 0..(length - str_len) {
                                        builder.write_char(' ')?;
                                    }
                                    builder.append_value("")?;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path.

Comment on lines 261 to +264
builder.write_str(string)?;
builder.write_str(
builder.append_value(
&" ".repeat(length - graphemes_buf.len()),
)?;
builder.append_value("");
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Result from builder.append_value is not handled, which is a critical bug. You should use the ? operator to propagate potential errors.

Also, " ".repeat(...) allocates a new String, which is inefficient. It's better to write the padding spaces in a loop.

                                    builder.write_str(string)?;
                                    for _ in 0..(length - graphemes_buf.len()) {
                                        builder.write_char(' ')?;
                                    }
                                    builder.append_value("")?;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path.

for _ in 0..full_reps {
builder.write_str(fill)?;
}
builder.append_value(&fill[..remainder]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Result from builder.append_value is not handled. This can hide errors and should be propagated using the ? operator.

                                        builder.append_value(&fill[..remainder])?;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several places where an append_xyz() method is called without checking it for an Err result. This may hide a problem and continue using the builder without some earlier data.

.unwrap();
builder.write_char(c)?;
}
builder.append_value("");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Result from builder.append_value is not handled. This can hide errors and should be propagated using the ? operator.

                                        builder.append_value("")?;

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several places where an append_xyz() method is called without checking it for an Err result. This may hide a problem and continue using the builder without some earlier data.

Comment on lines +49 to +79
fn create_unicode_string_array<O: OffsetSizeTrait>(
size: usize,
null_density: f32,
) -> arrow::array::GenericStringArray<O> {
let mut rng = rand::rng();
let mut builder = GenericStringBuilder::<O>::new();
for i in 0..size {
if rng.random::<f32>() < null_density {
builder.append_null();
} else {
builder.append_value(UNICODE_STRINGS[i % UNICODE_STRINGS.len()]);
}
}
builder.finish()
}

fn create_unicode_string_view_array(
size: usize,
null_density: f32,
) -> arrow::array::StringViewArray {
let mut rng = rand::rng();
let mut builder = StringViewBuilder::with_capacity(size);
for i in 0..size {
if rng.random::<f32>() < null_density {
builder.append_null();
} else {
builder.append_value(UNICODE_STRINGS[i % UNICODE_STRINGS.len()]);
}
}
builder.finish()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's significant code duplication between create_unicode_string_array and create_unicode_string_view_array. The logic inside both functions is nearly identical. Consider refactoring this to a single generic function or a macro to reduce duplication and improve maintainability.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The two functions are almost identical and they could be merged into a single more generic one and reused.

Comment on lines +127 to +143
if use_string_view {
let string_array = create_unicode_string_view_array(size, 0.1);
let fill_array = create_unicode_string_view_array(size, 0.1);
vec![
ColumnarValue::Array(Arc::new(string_array)),
ColumnarValue::Array(length_array),
ColumnarValue::Array(Arc::new(fill_array)),
]
} else {
let string_array = create_unicode_string_array::<i32>(size, 0.1);
let fill_array = create_unicode_string_array::<i32>(size, 0.1);
vec![
ColumnarValue::Array(Arc::new(string_array)),
ColumnarValue::Array(length_array),
ColumnarValue::Array(Arc::new(fill_array)),
]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if/else blocks contain duplicated logic for constructing the Vec<ColumnarValue>. You can refactor this to reduce code duplication by creating the string and fill arrays within the branches and then constructing the vector once.

Suggested change
if use_string_view {
let string_array = create_unicode_string_view_array(size, 0.1);
let fill_array = create_unicode_string_view_array(size, 0.1);
vec![
ColumnarValue::Array(Arc::new(string_array)),
ColumnarValue::Array(length_array),
ColumnarValue::Array(Arc::new(fill_array)),
]
} else {
let string_array = create_unicode_string_array::<i32>(size, 0.1);
let fill_array = create_unicode_string_array::<i32>(size, 0.1);
vec![
ColumnarValue::Array(Arc::new(string_array)),
ColumnarValue::Array(length_array),
ColumnarValue::Array(Arc::new(fill_array)),
]
}
let (string_array, fill_array): (arrow::array::ArrayRef, arrow::array::ArrayRef) = if use_string_view {
(
std::sync::Arc::new(create_unicode_string_view_array(size, 0.1)),
std::sync::Arc::new(create_unicode_string_view_array(size, 0.1)),
)
} else {
(
std::sync::Arc::new(create_unicode_string_array::<i32>(size, 0.1)),
std::sync::Arc::new(create_unicode_string_array::<i32>(size, 0.1)),
)
};
vec![
ColumnarValue::Array(string_array),
ColumnarValue::Array(length_array),
ColumnarValue::Array(fill_array),
]

Copy link
Owner Author

@martin-augment martin-augment Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The logic for constructing the ColumnarValues could be simplified. it will be both easier to maintain and it will be more performant.


- **str**: String expression to operate on. Can be a constant, column, or function, and any combination of operators.
- **n**: String length to pad to.
- **n**: String length to pad to. If the input string is longer than this length, it is truncated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with lpad's documentation and to be more precise, it would be good to specify that truncation happens on the right.

Suggested change
- **n**: String length to pad to. If the input string is longer than this length, it is truncated.
- **n**: String length to pad to. If the input string is longer than this length, it is truncated (on the right).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess.

@martin-augment
Copy link
Owner Author

3. Doc inconsistency between lpad and rpad on truncation description

lpad doc: "it is truncated (on the right)." rpad doc: "it is truncated."

Both truncate to the leftmost n characters. Either add the same parenthetical to both, or remove it from both. Suggested neutral wording: "the string is truncated to n characters".

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! Explaining the side of the truncation would be helpful. It will prevent the developer to guess.

@martin-augment
Copy link
Owner Author

295-316: Allocation on every padded row: " ".repeat(...) creates a heap String.

In the ASCII fast path (and the Unicode fallback), " ".repeat(length - str_len) allocates a new String per row. For a space-only fill, you could avoid this by writing spaces in a loop or pre-allocating a reusable buffer, similar to how the fill-string path avoids per-row allocation.

This is a minor optimization opportunity and not a blocker — the fast path is already a significant improvement.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The allocation of the String might be avoided by appending several space characters. This would prevent memory allocation in eventually hot path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments