20278: perf: Optimize lpad, rpad for ASCII strings by martin-augment · Pull Request #239 · martin-augment/datafusion

martin-augment · 2026-02-11T12:48:24Z

20278: To review by AI

The previous implementation incurred the overhead of Unicode machinery, even for the common case that both the input string and the fill string consistent only of ASCII characters. For the ASCII-only case, we can assume that the length in bytes equals the length in characters, and avoid expensive graphene-based segmentation. This follows similar optimizations applied elsewhere in the codebase. Benchmarks indicate this is a significant performance win for ASCII-only input (4x-10x faster) but only a mild regression for Unicode input (2-5% slower). Along the way: * Combine: a few instances of `write_str(str)? + append_value("")` with `append_value(str)`, which saves a few cycles * Add a missing test case for truncating the input string * Add benchmarks for Unicode input

gemini-code-assist · 2026-02-11T12:48:43Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces performance enhancements for the lpad and rpad string functions by adding an optimized code path for ASCII strings. This change bypasses the more computationally intensive Unicode grapheme segmentation when not needed, leading to faster execution for common ASCII data. Alongside this optimization, the PR expands the benchmarking suite to include Unicode string scenarios and clarifies the truncation behavior in the official documentation for these functions.

Highlights

Performance Optimization for ASCII Strings: Implemented a fast path for lpad and rpad functions when processing ASCII strings, significantly improving performance by avoiding expensive grapheme segmentation.
Enhanced Benchmarking: Added new benchmarking utilities and test cases specifically for Unicode strings to thoroughly evaluate the performance of lpad and rpad under various conditions.
Documentation Clarification: Updated the documentation for lpad and rpad to explicitly state that input strings longer than the target length n will be truncated.
New Test Cases: Introduced new test cases for both lpad and rpad to cover the truncation behavior when the input string exceeds the specified length.

Changelog

datafusion/functions/benches/pad.rs
- Updated benchmarking utilities and added new benchmarks for lpad and rpad with Unicode strings.
- Introduced new utility functions to create Unicode string arrays for benchmarking.
- Added new benchmark cases for lpad and rpad functions using Unicode strings with both Utf8 and StringView types.
datafusion/functions/src/unicode/lpad.rs
- Implemented an optimized ASCII fast path for the lpad function, bypassing expensive grapheme segmentation for ASCII inputs.
- Clarified the documentation for the n argument, explicitly stating that strings longer than n will be truncated.
- Added a new test case to verify the truncation behavior of lpad.
datafusion/functions/src/unicode/rpad.rs
- Implemented an optimized ASCII fast path for the rpad function, bypassing expensive grapheme segmentation for ASCII inputs.
- Clarified the documentation for the n argument and the function description, explicitly stating that strings longer than n will be truncated.
- Added a new test case to verify the truncation behavior of rpad.
docs/source/user-guide/sql/scalar_functions.md
- Updated the lpad function documentation to clarify that input strings longer than the specified length n will be truncated.
- Updated the rpad function documentation to clarify that input strings longer than the specified length n will be truncated.

Activity

The pull request description indicates that it is 'To review by AI', suggesting no prior human review activity or comments have been recorded.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-11T12:48:51Z

Walkthrough

This pull request adds performance optimizations to the lpad and rpad string padding functions by introducing ASCII fast paths that bypass grapheme segmentation when both input and fill strings are ASCII. The implementation retains existing grapheme-based logic for non-ASCII inputs. Documentation for the n parameter in both functions and their user guide entries is updated to clarify truncation behavior on the right when input strings exceed the specified length. New benchmarks with Unicode-aware test data and builders are added to measure performance across various Unicode scenarios for both Utf8 and Utf8View string storage formats.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20278-2026-02-11-12-48-20

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

datafusion/functions/src/unicode/rpad.rs (1)
241-252: Minor allocation in the ASCII fast path: " ".repeat() creates a heap String per row.

When padding is needed, " ".repeat(length - str_len) allocates a temporary String on each iteration. Consider writing spaces directly to the builder in a loop or using a pre-allocated buffer to avoid per-row heap allocations, which partially negates the fast-path benefit.
♻️ Suggested optimization
                            } else {
                                builder.write_str(string)?;
-                               builder.write_str(
-                                   " ".repeat(length - str_len).as_str(),
-                               )?;
+                               for _ in 0..length - str_len {
+                                   builder.write_char(' ')?;
+                               }
                                builder.append_value("");
                            }
datafusion/functions/src/unicode/lpad.rs (1)
295-303: Same " ".repeat() allocation pattern as rpad — consider writing spaces directly.

Same optional optimization applies here: per-row String allocation for spaces can be avoided by writing chars directly to the builder.
♻️ Suggested optimization
                    if string.is_ascii() {
                        // ASCII fast path: byte length == character length
                        let str_len = string.len();
                        if length < str_len {
                            builder.append_value(&string[..length]);
                        } else {
-                           builder.write_str(" ".repeat(length - str_len).as_str())?;
+                           for _ in 0..length - str_len {
+                               builder.write_char(' ')?;
+                           }
                            builder.append_value(string);
                        }
datafusion/functions/benches/pad.rs (1)

120-146: create_unicode_pad_args always produces 3-argument inputs (with fill).

This means the Unicode benchmarks only exercise the 3-arg code path. Consider also benchmarking the 2-arg (space-fill) path for Unicode, as it follows a different code path with different performance characteristics. This is a minor gap — optional to address.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-02-11T12:50:03Z

Pull Request Review: Optimize lpad/rpad for ASCII strings

Summary

This PR introduces ASCII fast-path optimizations for the lpad and rpad string functions, avoiding expensive grapheme segmentation when both input and fill strings are ASCII. The optimization is well-implemented and follows existing patterns in the codebase.

✅ Strengths

Performance Optimization: The ASCII fast path is a smart optimization that avoids expensive Unicode grapheme segmentation when not needed. This should provide significant performance improvements for common ASCII use cases.
Comprehensive Benchmarks: Excellent addition of Unicode-specific benchmarks that will help measure the performance impact of these changes and prevent regressions.
Documentation Improvements: The clarification in the documentation about truncation behavior ("on the right") is helpful and improves API clarity.
Test Coverage: Added test cases for truncation scenarios (lpad('hello', 2) → 'he') which were previously missing.
Correct Implementation: The logic correctly handles edge cases like empty fill strings, zero/negative lengths, and maintains proper behavior for Unicode strings.
Consistent Pattern: The optimization follows existing patterns used in other DataFusion string functions (character_length, substr, etc.).

🔍 Code Quality Observations

Minor: Potential Memory Allocation Optimization (lpad.rs:301, rpad.rs:249, 264)

In the ASCII fast path for the default space padding case, there are calls to " ".repeat(length - str_len):

builder.write_str(" ".repeat(length - str_len).as_str())?;

Consideration: String::repeat() allocates a new string. For large padding lengths, you might consider:

Writing spaces in chunks/loops to reduce temporary allocations
However, the current approach is cleaner and may be acceptable given that string builder internals likely optimize this

Impact: Low - this is a minor optimization opportunity, not a bug. The current code is correct and readable.

Observation: ASCII Check on Both Strings

The optimization requires both string.is_ascii() && fill.is_ascii() to be true. This is correct and necessary, but it means:

Mixed ASCII/Unicode scenarios still take the slow path
Empty fill strings are checked separately before the ASCII check

Consideration: The logic flow is:

Check if fill.is_empty() → return early
Check if both are ASCII → fast path
Otherwise → Unicode path

This ordering is optimal since empty fill is a simpler check.

🧪 Testing Observations

Strengths:

Truncation test cases added (lpad('hello', 2) and rpad('hello', 2))
Comprehensive Unicode benchmark cases added
Existing tests cover Unicode multi-byte characters (josé, éñ)

Minor Gap:
The tests don't explicitly verify the ASCII fast path is being used. Consider adding a test comment or benchmark that explicitly tests:

ASCII-only input with ASCII fill
Mixed ASCII input with Unicode fill (should use slow path)
Unicode input with ASCII fill (should use slow path)

However, the existing tests do implicitly cover these cases.

🔒 Security Review

✅ No security concerns identified

Proper bounds checking with i32::MAX validation
Safe slice indexing in ASCII fast path (&string[..length])
The ASCII slicing is safe because byte length = character length for ASCII
No potential for integer overflow in pad length calculations

📊 Performance Considerations

Expected improvements:

ASCII strings: Should see significant speedup (avoiding grapheme segmentation)
Unicode strings: No performance regression (same code path as before)

Benchmark additions are excellent:

Tests both ASCII (existing benchmarks) and Unicode strings (new benchmarks)
Tests both Utf8 and StringView types
Covers realistic string lengths (target=20)

📝 Documentation

✅ Documentation is clear and accurate

User-facing docs updated in scalar_functions.md
Inline code comments explain the optimization rationale
Function signatures properly describe truncation behavior

🎯 Recommendations

High Priority

None - the code is production-ready as-is.

Low Priority (Nice-to-have)

Consider a small comment explaining why is_ascii() enables the optimization:

if string.is_ascii() && fill.is_ascii() {
    // ASCII fast path: byte length == character length,
    // so we skip expensive grapheme segmentation.

Note: This comment already exists! Great documentation. ✅

If profiling shows the repeat() allocations are significant, could optimize the space-padding case further.

✅ Final Assessment

Verdict: APPROVED ✨

This is a well-implemented performance optimization that:

Maintains correctness for all input cases
Includes comprehensive benchmarks to measure impact
Follows existing codebase patterns
Has no security concerns
Improves test coverage

The code quality is high, the optimization is sound, and the testing is thorough. This is ready to merge.

📌 Summary Checklist

✅ Code quality and best practices
✅ No bugs or logical issues identified
✅ Performance improvements with no regressions
✅ No security concerns
✅ Test coverage is adequate
✅ Documentation is clear and accurate
✅ Follows repository conventions

Great work! 🚀

augmentcode · 2026-02-11T12:52:04Z

🤖 Augment PR Summary

Summary: This PR optimizes lpad/rpad by adding an ASCII fast path while preserving the existing Unicode-aware (grapheme-based) behavior.

Changes:

Added `is_ascii()` fast paths in `unicode::lpad` and `unicode::rpad` to avoid grapheme segmentation when both input and (optional) fill strings are ASCII.
Kept the existing grapheme-based implementation for non-ASCII inputs/fill strings to maintain Unicode correctness.
Extended the pad benchmarks to include Unicode-string workloads for both `Utf8` and `Utf8View` arrays.
Updated user-facing docs for `lpad`/`rpad` to mention truncation when the input exceeds the requested length.
Added regression tests verifying truncation behavior (e.g. `"hello"` with length `2` yields `"he"`).

Technical Notes: The ASCII path relies on byte length == character length and uses byte slicing/repetition to build padding without per-row grapheme iteration.

_{🤖 Was this summary useful? React with 👍 or 👎}

gemini-code-assist

Code Review

This pull request introduces a performance optimization for lpad and rpad functions by adding a fast path for ASCII strings. However, the current implementation allows for a potential Denial of Service (DoS) attack. An attacker could provide a large length parameter (up to i32::MAX), leading to excessive memory allocations and CPU usage, exacerbated by String::repeat. It is crucial to optimize the padding logic to write directly to the builder and enforce reasonable limits on the padding length to mitigate this. Otherwise, the changes are well-implemented, include benchmarks, and update documentation. A minor suggestion is to refactor the new benchmark code to reduce duplication.

gemini-code-assist · 2026-02-11T12:52:05Z

datafusion/functions/benches/pad.rs

+        // Utf8 type with Unicode strings
+        let args = create_unicode_pad_args(size, 20, false);
+        let arg_fields = args
+            .iter()
+            .enumerate()
+            .map(|(idx, arg)| {
+                Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
+            })
+            .collect::<Vec<_>>();
+
+        group.bench_function(
+            format!("lpad utf8 unicode [size={size}, target=20]"),
+            |b| {
+                b.iter(|| {
+                    let args_cloned = args.clone();
+                    black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs {
+                        args: args_cloned,
+                        arg_fields: arg_fields.clone(),
+                        number_rows: size,
+                        return_field: Field::new("f", DataType::Utf8, true).into(),
+                        config_options: Arc::clone(&config_options),
+                    }))
+                })
+            },
+        );
+
+        // StringView type with Unicode strings
+        let args = create_unicode_pad_args(size, 20, true);
+        let arg_fields = args
+            .iter()
+            .enumerate()
+            .map(|(idx, arg)| {
+                Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
+            })
+            .collect::<Vec<_>>();
+
+        group.bench_function(
+            format!("lpad stringview unicode [size={size}, target=20]"),
+            |b| {
+                b.iter(|| {
+                    let args_cloned = args.clone();
+                    black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs {
+                        args: args_cloned,
+                        arg_fields: arg_fields.clone(),
+                        number_rows: size,
+                        return_field: Field::new("f", DataType::Utf8View, true).into(),
+                        config_options: Arc::clone(&config_options),
+                    }))
+                })
+            },
+        );


There's significant code duplication between the benchmarks for lpad utf8 unicode and lpad stringview unicode. This pattern is also present in the existing ASCII benchmarks and is repeated for the new rpad unicode benchmarks.

To improve maintainability, you could refactor this by creating a helper function. For example:

fn bench_pad_unicode( group: &mut criterion::BenchmarkGroup, pad_fn: &dyn Fn() -> std::sync::Arc<datafusion_expr::ScalarUDF>, fn_name: &str, size: usize, use_string_view: bool, config_options: &std::sync::Arc<datafusion_common::config::ConfigOptions>, ) { let args = create_unicode_pad_args(size, 20, use_string_view); let arg_fields: Vec<std::sync::Arc<arrow::datatypes::Field>> = args .iter() .enumerate() .map(|(idx, arg)| { arrow::datatypes::Field::new(format!("arg_{idx}"), arg.data_type(), true).into() }) .collect(); let (type_name, return_type) = if use_string_view { ("stringview", arrow::datatypes::DataType::Utf8View) } else { ("utf8", arrow::datatypes::DataType::Utf8) }; group.bench_function( format!("{fn_name} {type_name} unicode [size={size}, target=20]"), |b| { b.iter(|| { let args_cloned = args.clone(); std::hint::black_box(pad_fn().invoke_with_args(datafusion_expr::ScalarFunctionArgs { args: args_cloned, arg_fields: arg_fields.clone(), number_rows: size, return_field: arrow::datatypes::Field::new("f", return_type.clone(), true).into(), config_options: std::sync::Arc::clone(config_options), })); }) }, ); }

You could then invoke it for lpad and rpad like this:

// In lpad group bench_pad_unicode(group, &unicode::lpad, "lpad", size, false, &config_options); bench_pad_unicode(group, &unicode::lpad, "lpad", size, true, &config_options); // In rpad group bench_pad_unicode(group, &unicode::rpad, "rpad", size, false, &config_options); bench_pad_unicode(group, &unicode::rpad, "rpad", size, true, &config_options);

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! There is duplicated code in the benchmark test that would be good to be simplified by extracting it into a helper function and reusing it.

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-11T12:52:05Z

datafusion/functions/src/unicode/rpad.rs

-    argument(name = "n", description = "String length to pad to."),
+    argument(
+        name = "n",
+        description = "String length to pad to. If the input string is longer than this length, it is truncated."


The n argument docs mention truncation but don’t say where it truncates; elsewhere (and for lpad) it’s described as truncating “on the right”. Consider clarifying the direction here as well to avoid ambiguity for users.

Severity: low

Other Locations

docs/source/user-guide/sql/scalar_functions.md:1823

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The documentation for rpad does not mention the direction of the padding, in contrast to lpad. It would be good to synchronize the documentations and make them as specific as possible

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

augmentcode bot reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20278: perf: Optimize lpad, rpad for ASCII strings#239

20278: perf: Optimize lpad, rpad for ASCII strings#239
martin-augment wants to merge 1 commit intomainfrom
pr-20278-2026-02-11-12-48-20

martin-augment commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 11, 2026

Uh oh!

augmentcode bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

martin-augment Feb 12, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

augmentcode bot Feb 11, 2026

Uh oh!

martin-augment Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Feb 11, 2026

Pull Request Review: Optimize lpad/rpad for ASCII strings

Summary

✅ Strengths

🔍 Code Quality Observations

Minor: Potential Memory Allocation Optimization (lpad.rs:301, rpad.rs:249, 264)

Observation: ASCII Check on Both Strings

🧪 Testing Observations

🔒 Security Review

📊 Performance Considerations

📝 Documentation

🎯 Recommendations

High Priority

Low Priority (Nice-to-have)

✅ Final Assessment

📌 Summary Checklist

Uh oh!

augmentcode bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 11, 2026 •

edited

Loading