20278: perf: Optimize lpad, rpad for ASCII strings#239
20278: perf: Optimize lpad, rpad for ASCII strings#239martin-augment wants to merge 1 commit intomainfrom
Conversation
The previous implementation incurred the overhead of Unicode machinery,
even for the common case that both the input string and the fill string
consistent only of ASCII characters. For the ASCII-only case, we can
assume that the length in bytes equals the length in characters, and
avoid expensive graphene-based segmentation. This follows similar
optimizations applied elsewhere in the codebase.
Benchmarks indicate this is a significant performance win for ASCII-only
input (4x-10x faster) but only a mild regression for Unicode input (2-5%
slower).
Along the way:
* Combine: a few instances of `write_str(str)? + append_value("")` with
`append_value(str)`, which saves a few cycles
* Add a missing test case for truncating the input string
* Add benchmarks for Unicode input
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces performance enhancements for the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request adds performance optimizations to the lpad and rpad string padding functions by introducing ASCII fast paths that bypass grapheme segmentation when both input and fill strings are ASCII. The implementation retains existing grapheme-based logic for non-ASCII inputs. Documentation for the ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: Optimize lpad/rpad for ASCII stringsSummaryThis PR introduces ASCII fast-path optimizations for the ✅ Strengths
🔍 Code Quality ObservationsMinor: Potential Memory Allocation Optimization (lpad.rs:301, rpad.rs:249, 264)In the ASCII fast path for the default space padding case, there are calls to builder.write_str(" ".repeat(length - str_len).as_str())?;Consideration:
Impact: Low - this is a minor optimization opportunity, not a bug. The current code is correct and readable. Observation: ASCII Check on Both StringsThe optimization requires both
Consideration: The logic flow is:
This ordering is optimal since empty fill is a simpler check. 🧪 Testing ObservationsStrengths:
Minor Gap:
However, the existing tests do implicitly cover these cases. 🔒 Security Review✅ No security concerns identified
📊 Performance ConsiderationsExpected improvements:
Benchmark additions are excellent:
📝 Documentation✅ Documentation is clear and accurate
🎯 RecommendationsHigh PriorityNone - the code is production-ready as-is. Low Priority (Nice-to-have)
✅ Final AssessmentVerdict: APPROVED ✨ This is a well-implemented performance optimization that:
The code quality is high, the optimization is sound, and the testing is thorough. This is ready to merge. 📌 Summary Checklist
Great work! 🚀 |
🤖 Augment PR SummarySummary: This PR optimizes Changes:
Technical Notes: The ASCII path relies on byte length == character length and uses byte slicing/repetition to build padding without per-row grapheme iteration. 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for lpad and rpad functions by adding a fast path for ASCII strings. However, the current implementation allows for a potential Denial of Service (DoS) attack. An attacker could provide a large length parameter (up to i32::MAX), leading to excessive memory allocations and CPU usage, exacerbated by String::repeat. It is crucial to optimize the padding logic to write directly to the builder and enforce reasonable limits on the padding length to mitigate this. Otherwise, the changes are well-implemented, include benchmarks, and update documentation. A minor suggestion is to refactor the new benchmark code to reduce duplication.
| // Utf8 type with Unicode strings | ||
| let args = create_unicode_pad_args(size, 20, false); | ||
| let arg_fields = args | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(idx, arg)| { | ||
| Field::new(format!("arg_{idx}"), arg.data_type(), true).into() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| group.bench_function( | ||
| format!("lpad utf8 unicode [size={size}, target=20]"), | ||
| |b| { | ||
| b.iter(|| { | ||
| let args_cloned = args.clone(); | ||
| black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs { | ||
| args: args_cloned, | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: size, | ||
| return_field: Field::new("f", DataType::Utf8, true).into(), | ||
| config_options: Arc::clone(&config_options), | ||
| })) | ||
| }) | ||
| }, | ||
| ); | ||
|
|
||
| // StringView type with Unicode strings | ||
| let args = create_unicode_pad_args(size, 20, true); | ||
| let arg_fields = args | ||
| .iter() | ||
| .enumerate() | ||
| .map(|(idx, arg)| { | ||
| Field::new(format!("arg_{idx}"), arg.data_type(), true).into() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| group.bench_function( | ||
| format!("lpad stringview unicode [size={size}, target=20]"), | ||
| |b| { | ||
| b.iter(|| { | ||
| let args_cloned = args.clone(); | ||
| black_box(unicode::lpad().invoke_with_args(ScalarFunctionArgs { | ||
| args: args_cloned, | ||
| arg_fields: arg_fields.clone(), | ||
| number_rows: size, | ||
| return_field: Field::new("f", DataType::Utf8View, true).into(), | ||
| config_options: Arc::clone(&config_options), | ||
| })) | ||
| }) | ||
| }, | ||
| ); |
There was a problem hiding this comment.
There's significant code duplication between the benchmarks for lpad utf8 unicode and lpad stringview unicode. This pattern is also present in the existing ASCII benchmarks and is repeated for the new rpad unicode benchmarks.
To improve maintainability, you could refactor this by creating a helper function. For example:
fn bench_pad_unicode(
group: &mut criterion::BenchmarkGroup,
pad_fn: &dyn Fn() -> std::sync::Arc<datafusion_expr::ScalarUDF>,
fn_name: &str,
size: usize,
use_string_view: bool,
config_options: &std::sync::Arc<datafusion_common::config::ConfigOptions>,
) {
let args = create_unicode_pad_args(size, 20, use_string_view);
let arg_fields: Vec<std::sync::Arc<arrow::datatypes::Field>> = args
.iter()
.enumerate()
.map(|(idx, arg)| {
arrow::datatypes::Field::new(format!("arg_{idx}"), arg.data_type(), true).into()
})
.collect();
let (type_name, return_type) = if use_string_view {
("stringview", arrow::datatypes::DataType::Utf8View)
} else {
("utf8", arrow::datatypes::DataType::Utf8)
};
group.bench_function(
format!("{fn_name} {type_name} unicode [size={size}, target=20]"),
|b| {
b.iter(|| {
let args_cloned = args.clone();
std::hint::black_box(pad_fn().invoke_with_args(datafusion_expr::ScalarFunctionArgs {
args: args_cloned,
arg_fields: arg_fields.clone(),
number_rows: size,
return_field: arrow::datatypes::Field::new("f", return_type.clone(), true).into(),
config_options: std::sync::Arc::clone(config_options),
}));
})
},
);
}You could then invoke it for lpad and rpad like this:
// In lpad group
bench_pad_unicode(group, &unicode::lpad, "lpad", size, false, &config_options);
bench_pad_unicode(group, &unicode::lpad, "lpad", size, true, &config_options);
// In rpad group
bench_pad_unicode(group, &unicode::rpad, "rpad", size, false, &config_options);
bench_pad_unicode(group, &unicode::rpad, "rpad", size, true, &config_options);There was a problem hiding this comment.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! There is duplicated code in the benchmark test that would be good to be simplified by extracting it into a helper function and reusing it.
| argument(name = "n", description = "String length to pad to."), | ||
| argument( | ||
| name = "n", | ||
| description = "String length to pad to. If the input string is longer than this length, it is truncated." |
There was a problem hiding this comment.
The n argument docs mention truncation but don’t say where it truncates; elsewhere (and for lpad) it’s described as truncating “on the right”. Consider clarifying the direction here as well to avoid ambiguity for users.
Severity: low
Other Locations
docs/source/user-guide/sql/scalar_functions.md:1823
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! The documentation for rpad does not mention the direction of the padding, in contrast to lpad. It would be good to synchronize the documentations and make them as specific as possible
20278: To review by AI