Skip to content

20278: perf: Optimize lpad, rpad for ASCII strings#239

Open
martin-augment wants to merge 1 commit intomainfrom
pr-20278-2026-02-11-12-48-20
Open

20278: perf: Optimize lpad, rpad for ASCII strings#239
martin-augment wants to merge 1 commit intomainfrom
pr-20278-2026-02-11-12-48-20

Conversation

@martin-augment
Copy link
Owner

20278: To review by AI

The previous implementation incurred the overhead of Unicode machinery,
even for the common case that both the input string and the fill string
consistent only of ASCII characters. For the ASCII-only case, we can
assume that the length in bytes equals the length in characters, and
avoid expensive graphene-based segmentation. This follows similar
optimizations applied elsewhere in the codebase.

Benchmarks indicate this is a significant performance win for ASCII-only
input (4x-10x faster) but only a mild regression for Unicode input (2-5%
slower).

Along the way:

* Combine: a few instances of `write_str(str)? + append_value("")` with
  `append_value(str)`, which saves a few cycles
* Add a missing test case for truncating the input string
* Add benchmarks for Unicode input
@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces performance enhancements for the lpad and rpad string functions by adding an optimized code path for ASCII strings. This change bypasses the more computationally intensive Unicode grapheme segmentation when not needed, leading to faster execution for common ASCII data. Alongside this optimization, the PR expands the benchmarking suite to include Unicode string scenarios and clarifies the truncation behavior in the official documentation for these functions.

Highlights

  • Performance Optimization for ASCII Strings: Implemented a fast path for lpad and rpad functions when processing ASCII strings, significantly improving performance by avoiding expensive grapheme segmentation.
  • Enhanced Benchmarking: Added new benchmarking utilities and test cases specifically for Unicode strings to thoroughly evaluate the performance of lpad and rpad under various conditions.
  • Documentation Clarification: Updated the documentation for lpad and rpad to explicitly state that input strings longer than the target length n will be truncated.
  • New Test Cases: Introduced new test cases for both lpad and rpad to cover the truncation behavior when the input string exceeds the specified length.
Changelog
  • datafusion/functions/benches/pad.rs
    • Updated benchmarking utilities and added new benchmarks for lpad and rpad with Unicode strings.
    • Introduced new utility functions to create Unicode string arrays for benchmarking.
    • Added new benchmark cases for lpad and rpad functions using Unicode strings with both Utf8 and StringView types.
  • datafusion/functions/src/unicode/lpad.rs
    • Implemented an optimized ASCII fast path for the lpad function, bypassing expensive grapheme segmentation for ASCII inputs.
    • Clarified the documentation for the n argument, explicitly stating that strings longer than n will be truncated.
    • Added a new test case to verify the truncation behavior of lpad.
  • datafusion/functions/src/unicode/rpad.rs
    • Implemented an optimized ASCII fast path for the rpad function, bypassing expensive grapheme segmentation for ASCII inputs.
    • Clarified the documentation for the n argument and the function description, explicitly stating that strings longer than n will be truncated.
    • Added a new test case to verify the truncation behavior of rpad.
  • docs/source/user-guide/sql/scalar_functions.md
    • Updated the lpad function documentation to clarify that input strings longer than the specified length n will be truncated.
    • Updated the rpad function documentation to clarify that input strings longer than the specified length n will be truncated.
Activity
  • The pull request description indicates that it is 'To review by AI', suggesting no prior human review activity or comments have been recorded.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Walkthrough

This pull request adds performance optimizations to the lpad and rpad string padding functions by introducing ASCII fast paths that bypass grapheme segmentation when both input and fill strings are ASCII. The implementation retains existing grapheme-based logic for non-ASCII inputs. Documentation for the n parameter in both functions and their user guide entries is updated to clarify truncation behavior on the right when input strings exceed the specified length. New benchmarks with Unicode-aware test data and builders are added to measure performance across various Unicode scenarios for both Utf8 and Utf8View string storage formats.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20278-2026-02-11-12-48-20

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
datafusion/functions/src/unicode/rpad.rs (1)

241-252: Minor allocation in the ASCII fast path: " ".repeat() creates a heap String per row.

When padding is needed, " ".repeat(length - str_len) allocates a temporary String on each iteration. Consider writing spaces directly to the builder in a loop or using a pre-allocated buffer to avoid per-row heap allocations, which partially negates the fast-path benefit.

♻️ Suggested optimization
                            } else {
                                builder.write_str(string)?;
-                               builder.write_str(
-                                   " ".repeat(length - str_len).as_str(),
-                               )?;
+                               for _ in 0..length - str_len {
+                                   builder.write_char(' ')?;
+                               }
                                builder.append_value("");
                            }
datafusion/functions/src/unicode/lpad.rs (1)

295-303: Same " ".repeat() allocation pattern as rpad — consider writing spaces directly.

Same optional optimization applies here: per-row String allocation for spaces can be avoided by writing chars directly to the builder.

♻️ Suggested optimization
                    if string.is_ascii() {
                        // ASCII fast path: byte length == character length
                        let str_len = string.len();
                        if length < str_len {
                            builder.append_value(&string[..length]);
                        } else {
-                           builder.write_str(" ".repeat(length - str_len).as_str())?;
+                           for _ in 0..length - str_len {
+                               builder.write_char(' ')?;
+                           }
                            builder.append_value(string);
                        }
datafusion/functions/benches/pad.rs (1)

120-146: create_unicode_pad_args always produces 3-argument inputs (with fill).

This means the Unicode benchmarks only exercise the 3-arg code path. Consider also benchmarking the 2-arg (space-fill) path for Unicode, as it follows a different code path with different performance characteristics. This is a minor gap — optional to address.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 11, 2026

Pull Request Review: Optimize lpad/rpad for ASCII strings

Summary

This PR introduces ASCII fast-path optimizations for the lpad and rpad string functions, avoiding expensive grapheme segmentation when both input and fill strings are ASCII. The optimization is well-implemented and follows existing patterns in the codebase.


✅ Strengths

  1. Performance Optimization: The ASCII fast path is a smart optimization that avoids expensive Unicode grapheme segmentation when not needed. This should provide significant performance improvements for common ASCII use cases.

  2. Comprehensive Benchmarks: Excellent addition of Unicode-specific benchmarks that will help measure the performance impact of these changes and prevent regressions.

  3. Documentation Improvements: The clarification in the documentation about truncation behavior ("on the right") is helpful and improves API clarity.

  4. Test Coverage: Added test cases for truncation scenarios (lpad('hello', 2)'he') which were previously missing.

  5. Correct Implementation: The logic correctly handles edge cases like empty fill strings, zero/negative lengths, and maintains proper behavior for Unicode strings.

  6. Consistent Pattern: The optimization follows existing patterns used in other DataFusion string functions (character_length, substr, etc.).


🔍 Code Quality Observations

Minor: Potential Memory Allocation Optimization (lpad.rs:301, rpad.rs:249, 264)

In the ASCII fast path for the default space padding case, there are calls to " ".repeat(length - str_len):

builder.write_str(" ".repeat(length - str_len).as_str())?;

Consideration: String::repeat() allocates a new string. For large padding lengths, you might consider:

  • Writing spaces in chunks/loops to reduce temporary allocations
  • However, the current approach is cleaner and may be acceptable given that string builder internals likely optimize this

Impact: Low - this is a minor optimization opportunity, not a bug. The current code is correct and readable.


Observation: ASCII Check on Both Strings

The optimization requires both string.is_ascii() && fill.is_ascii() to be true. This is correct and necessary, but it means:

  • Mixed ASCII/Unicode scenarios still take the slow path
  • Empty fill strings are checked separately before the ASCII check

Consideration: The logic flow is:

  1. Check if fill.is_empty() → return early
  2. Check if both are ASCII → fast path
  3. Otherwise → Unicode path

This ordering is optimal since empty fill is a simpler check.


🧪 Testing Observations

Strengths:

  • Truncation test cases added (lpad('hello', 2) and rpad('hello', 2))
  • Comprehensive Unicode benchmark cases added
  • Existing tests cover Unicode multi-byte characters (josé, éñ)

Minor Gap:
The tests don't explicitly verify the ASCII fast path is being used. Consider adding a test comment or benchmark that explicitly tests:

  • ASCII-only input with ASCII fill
  • Mixed ASCII input with Unicode fill (should use slow path)
  • Unicode input with ASCII fill (should use slow path)

However, the existing tests do implicitly cover these cases.


🔒 Security Review

No security concerns identified

  • Proper bounds checking with i32::MAX validation
  • Safe slice indexing in ASCII fast path (&string[..length])
  • The ASCII slicing is safe because byte length = character length for ASCII
  • No potential for integer overflow in pad length calculations

📊 Performance Considerations

Expected improvements:

  • ASCII strings: Should see significant speedup (avoiding grapheme segmentation)
  • Unicode strings: No performance regression (same code path as before)

Benchmark additions are excellent:

  • Tests both ASCII (existing benchmarks) and Unicode strings (new benchmarks)
  • Tests both Utf8 and StringView types
  • Covers realistic string lengths (target=20)

📝 Documentation

Documentation is clear and accurate

  • User-facing docs updated in scalar_functions.md
  • Inline code comments explain the optimization rationale
  • Function signatures properly describe truncation behavior

🎯 Recommendations

High Priority

None - the code is production-ready as-is.

Low Priority (Nice-to-have)

  1. Consider a small comment explaining why is_ascii() enables the optimization:

    if string.is_ascii() && fill.is_ascii() {
        // ASCII fast path: byte length == character length,
        // so we skip expensive grapheme segmentation.

    Note: This comment already exists! Great documentation. ✅

  2. If profiling shows the repeat() allocations are significant, could optimize the space-padding case further.


✅ Final Assessment

Verdict: APPROVED

This is a well-implemented performance optimization that:

  • Maintains correctness for all input cases
  • Includes comprehensive benchmarks to measure impact
  • Follows existing codebase patterns
  • Has no security concerns
  • Improves test coverage

The code quality is high, the optimization is sound, and the testing is thorough. This is ready to merge.


📌 Summary Checklist

  • ✅ Code quality and best practices
  • ✅ No bugs or logical issues identified
  • ✅ Performance improvements with no regressions
  • ✅ No security concerns
  • ✅ Test coverage is adequate
  • ✅ Documentation is clear and accurate
  • ✅ Follows repository conventions

Great work! 🚀

@augmentcode
Copy link

augmentcode bot commented Feb 11, 2026

🤖 Augment PR Summary

Summary: This PR optimizes lpad/rpad by adding an ASCII fast path while preserving the existing Unicode-aware (grapheme-based) behavior.

Changes:

  • Added `is_ascii()` fast paths in `unicode::lpad` and `unicode::rpad` to avoid grapheme segmentation when both input and (optional) fill strings are ASCII.
  • Kept the existing grapheme-based implementation for non-ASCII inputs/fill strings to maintain Unicode correctness.
  • Extended the pad benchmarks to include Unicode-string workloads for both `Utf8` and `Utf8View` arrays.
  • Updated user-facing docs for `lpad`/`rpad` to mention truncation when the input exceeds the requested length.
  • Added regression tests verifying truncation behavior (e.g. `"hello"` with length `2` yields `"he"`).

Technical Notes: The ASCII path relies on byte length == character length and uses byte slicing/repetition to build padding without per-row grapheme iteration.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for lpad and rpad functions by adding a fast path for ASCII strings. However, the current implementation allows for a potential Denial of Service (DoS) attack. An attacker could provide a large length parameter (up to i32::MAX), leading to excessive memory allocations and CPU usage, exacerbated by String::repeat. It is crucial to optimize the padding logic to write directly to the builder and enforce reasonable limits on the padding length to mitigate this. Otherwise, the changes are well-implemented, include benchmarks, and update documentation. A minor suggestion is to refactor the new benchmark code to reduce duplication.

Comment on lines +289 to +339
// Utf8 type with Unicode strings
let args = create_unicode_pad_args(size, 20, false);
let arg_fields = args
.iter()
.enumerate()
.map(|(idx, arg)| {
Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect::<Vec<_>>();

group.bench_function(
format!("lpad utf8 unicode [size={size}, target=20]"),
|b| {
b.iter(|| {
let args_cloned = args.clone();
black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs {
args: args_cloned,
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", DataType::Utf8, true).into(),
config_options: Arc::clone(&config_options),
}))
})
},
);

// StringView type with Unicode strings
let args = create_unicode_pad_args(size, 20, true);
let arg_fields = args
.iter()
.enumerate()
.map(|(idx, arg)| {
Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect::<Vec<_>>();

group.bench_function(
format!("lpad stringview unicode [size={size}, target=20]"),
|b| {
b.iter(|| {
let args_cloned = args.clone();
black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs {
args: args_cloned,
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: Field::new("f", DataType::Utf8View, true).into(),
config_options: Arc::clone(&config_options),
}))
})
},
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's significant code duplication between the benchmarks for lpad utf8 unicode and lpad stringview unicode. This pattern is also present in the existing ASCII benchmarks and is repeated for the new rpad unicode benchmarks.

To improve maintainability, you could refactor this by creating a helper function. For example:

fn bench_pad_unicode(
    group: &mut criterion::BenchmarkGroup,
    pad_fn: &dyn Fn() -> std::sync::Arc<datafusion_expr::ScalarUDF>,
    fn_name: &str,
    size: usize,
    use_string_view: bool,
    config_options: &std::sync::Arc<datafusion_common::config::ConfigOptions>,
) {
    let args = create_unicode_pad_args(size, 20, use_string_view);
    let arg_fields: Vec<std::sync::Arc<arrow::datatypes::Field>> = args
        .iter()
        .enumerate()
        .map(|(idx, arg)| {
            arrow::datatypes::Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
        })
        .collect();

    let (type_name, return_type) = if use_string_view {
        ("stringview", arrow::datatypes::DataType::Utf8View)
    } else {
        ("utf8", arrow::datatypes::DataType::Utf8)
    };

    group.bench_function(
        format!("{fn_name} {type_name} unicode [size={size}, target=20]"),
        |b| {
            b.iter(|| {
                let args_cloned = args.clone();
                std::hint::black_box(pad_fn().invoke_with_args(datafusion_expr::ScalarFunctionArgs {
                    args: args_cloned,
                    arg_fields: arg_fields.clone(),
                    number_rows: size,
                    return_field: arrow::datatypes::Field::new("f", return_type.clone(), true).into(),
                    config_options: std::sync::Arc::clone(config_options),
                }));
            })
        },
    );
}

You could then invoke it for lpad and rpad like this:

// In lpad group
bench_pad_unicode(group, &unicode::lpad, "lpad", size, false, &config_options);
bench_pad_unicode(group, &unicode::lpad, "lpad", size, true, &config_options);

// In rpad group
bench_pad_unicode(group, &unicode::rpad, "rpad", size, false, &config_options);
bench_pad_unicode(group, &unicode::rpad, "rpad", size, true, &config_options);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! There is duplicated code in the benchmark test that would be good to be simplified by extracting it into a helper function and reusing it.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

argument(name = "n", description = "String length to pad to."),
argument(
name = "n",
description = "String length to pad to. If the input string is longer than this length, it is truncated."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n argument docs mention truncation but don’t say where it truncates; elsewhere (and for lpad) it’s described as truncating “on the right”. Consider clarifying the direction here as well to avoid ambiguity for users.

Severity: low

Other Locations
  • docs/source/user-guide/sql/scalar_functions.md:1823

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The documentation for rpad does not mention the direction of the padding, in contrast to lpad. It would be good to synchronize the documentations and make them as specific as possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants