20305: perf: Optimize translate() UDF for scalar inputs#245
20305: perf: Optimize translate() UDF for scalar inputs#245martin-augment wants to merge 5 commits intomainfrom
Conversation
AFAIK there is no reason not to support this.
When the second and third arguments are constants (which is common), we can build the lookup table once, rather than rebuilding it for every input row. When all of the arguments are ASCII-only, we can do lookups via a fixed-size lookup table that directly maps ASCII byte values, rather than a hash table.
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThe changes refactor and enhance the translate function implementation across benchmarks, core function logic, and documentation. The benchmark suite expands to test additional string lengths and separates array-based from scalar-based translation paths. The core function adds support for LargeUtf8 input types, implements an optimized fast path for constant ASCII inputs using precomputed lookup tables, and provides a grapheme-based fallback for non-ASCII content. Parameter names are standardized from "chars/translation" to "from/to" across the codebase. Documentation and test coverage are updated accordingly. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Optimize
|
| Severity | Issue |
|---|---|
| Bug | Unnecessary ascii_buf.clone() doubles allocation in the hot path (#2) |
| Bug | Typo in doc comment "rows" → "uses" (#3) |
| Design | Null scalar vs. array ambiguity in try_as_scalar_str (#1) |
| Design | LargeUtf8 signature/array-path consistency (#4, #5) |
| Perf | string_graphemes Vec is unnecessary in slow path (#6) |
| Tests | Missing coverage for LargeUtf8, array args, and ASCII fast-path (#7–#9) |
The most important fix before merging is #2 (the clone in the fast path undermines the optimization). The rest are improvements worth addressing.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for the translate() UDF, specifically for cases where the from and to arguments are scalar values. It achieves this by pre-computing the translation map and adding a fast path for ASCII-only translations. The changes are well-structured, including new benchmarks to validate the performance gains and updated documentation.
I've found one area for a minor improvement in the new implementation to make it even more efficient by removing an intermediate allocation. Overall, this is a high-quality contribution.
| string_graphemes.clear(); | ||
| result_graphemes.clear(); | ||
|
|
||
| string_graphemes.extend(s.graphemes(true)); | ||
| for c in &string_graphemes { | ||
| match from_map.get(*c) { | ||
| Some(n) => { | ||
| if let Some(replacement) = to_graphemes.get(*n) { | ||
| result_graphemes.push(*replacement); | ||
| } | ||
| } | ||
| None => result_graphemes.push(*c), | ||
| } | ||
| } |
There was a problem hiding this comment.
The string_graphemes vector seems unnecessary. You can iterate directly over the graphemes of the input string s to build result_graphemes. This avoids populating an intermediate vector for each row, which should be more efficient.
You would also need to remove the declaration of string_graphemes at line 316.
result_graphemes.clear();
for c in s.graphemes(true) {
match from_map.get(c) {
Some(n) => {
if let Some(replacement) = to_graphemes.get(*n) {
result_graphemes.push(*replacement);
}
}
None => result_graphemes.push(c),
}
}
🤖 Augment PR SummarySummary: This PR optimizes the Changes:
Technical Notes: The optimized path reuses the existing grapheme-based behavior for non-ASCII inputs, while accelerating ASCII-only cases via direct byte substitution. 🤖 Was this summary useful? React with 👍 or 👎 |
| vec![ | ||
| Exact(vec![Utf8View, Utf8, Utf8]), | ||
| Exact(vec![Utf8, Utf8, Utf8]), | ||
| Exact(vec![LargeUtf8, Utf8, Utf8]), |
There was a problem hiding this comment.
The new signature arm Exact(vec![LargeUtf8, Utf8, Utf8]) looks inconsistent with invoke_translate’s DataType::LargeUtf8 branch, which downcasts from/to via as_string::<i64>(). If from/to are arrays (as allowed by the signature), this will likely panic due to offset-size mismatch (Utf8/i32 vs LargeUtf8/i64).
Severity: high
Other Locations
datafusion/functions/src/unicode/translate.rs:194datafusion/functions/src/unicode/translate.rs:195
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datafusion/functions/src/unicode/translate.rs (1)
192-197:⚠️ Potential issue | 🔴 CriticalBug:
invoke_translateLargeUtf8 branch castsfrom/toarrays with wrong offset type.The signature at line 74 declares
Exact(vec![LargeUtf8, Utf8, Utf8]), wherefromandtoareUtf8(i32 offsets). However, theinvoke_translatefallback path (lines 192–197) incorrectly casts them asas_string::<i64>()when the first argument isLargeUtf8. This will panic at runtime whenfrom/toare non-scalarUtf8arrays paired with aLargeUtf8first argument.Proposed fix
DataType::LargeUtf8 => { let string_array = args[0].as_string::<i64>(); - let from_array = args[1].as_string::<i64>(); - let to_array = args[2].as_string::<i64>(); + let from_array = args[1].as_string::<i32>(); + let to_array = args[2].as_string::<i32>(); translate::<i64, _, _>(string_array, from_array, to_array) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/unicode/translate.rs` around lines 192 - 197, The LargeUtf8 branch in invoke_translate casts the `from`/`to` arrays with the wrong offset type (uses as_string::<i64>()); keep the first argument as string_array = args[0].as_string::<i64>(), but cast args[1] and args[2] to as_string::<i32>() and call translate with the correct generic offset types (e.g., translate::<i64, i32, i32>) so the `LargeUtf8` first argument pairs with Utf8 `from`/`to` correctly.
🧹 Nitpick comments (1)
datafusion/functions/src/unicode/translate.rs (1)
324-337: Avoid redundant UTF-8 validation on known-ASCII bytes.
ascii_bufis guaranteed to contain only ASCII bytes (each byte comes from the table which maps ASCII → ASCII or filters viaASCII_DELETE). Thefrom_utf8validation is unnecessary overhead on a performance-critical hot path. Consider usingfrom_utf8_uncheckedor at minimum a debug-only assertion.♻️ Proposed change
- return String::from_utf8(ascii_buf.clone()).unwrap(); + // SAFETY: `ascii_buf` only contains bytes produced by the + // ASCII translate table, which maps ASCII→ASCII and filters + // via ASCII_DELETE; every byte is in 0..=127 and thus valid + // UTF-8. + debug_assert!(ascii_buf.is_ascii()); + return unsafe { + String::from_utf8_unchecked(ascii_buf.clone()) + };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@datafusion/functions/src/unicode/translate.rs` around lines 324 - 337, The code currently constructs ascii_buf from ascii_table and then calls String::from_utf8(ascii_buf.clone()).unwrap(), performing redundant UTF-8 validation; change this to construct the String without runtime validation by using String::from_utf8_unchecked(ascii_buf.clone()) inside an unsafe block (or, if you prefer safety during development, add a debug-only assertion like debug_assert!(std::str::from_utf8(&ascii_buf).is_ok()) before the unsafe conversion). Locate the block referencing ascii_table, ASCII_DELETE, ascii_buf, and s.is_ascii() in translate.rs and replace the from_utf8(...).unwrap() call with the unchecked conversion (keeping the comment that ascii_buf contains only ASCII bytes) and ensure the unsafe usage is narrowly scoped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 303-306: Fix the typo in the doc comment above the optimized
translate implementation in translate.rs: change the phrase "ASCII input rows
the lookup table" to "ASCII input rows use the lookup table" in the block
describing the ASCII byte lookup table behavior for constant `from`/`to`
optimized translate.
---
Outside diff comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 192-197: The LargeUtf8 branch in invoke_translate casts the
`from`/`to` arrays with the wrong offset type (uses as_string::<i64>()); keep
the first argument as string_array = args[0].as_string::<i64>(), but cast
args[1] and args[2] to as_string::<i32>() and call translate with the correct
generic offset types (e.g., translate::<i64, i32, i32>) so the `LargeUtf8` first
argument pairs with Utf8 `from`/`to` correctly.
---
Nitpick comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 324-337: The code currently constructs ascii_buf from ascii_table
and then calls String::from_utf8(ascii_buf.clone()).unwrap(), performing
redundant UTF-8 validation; change this to construct the String without runtime
validation by using String::from_utf8_unchecked(ascii_buf.clone()) inside an
unsafe block (or, if you prefer safety during development, add a debug-only
assertion like debug_assert!(std::str::from_utf8(&ascii_buf).is_ok()) before the
unsafe conversion). Locate the block referencing ascii_table, ASCII_DELETE,
ascii_buf, and s.is_ascii() in translate.rs and replace the
from_utf8(...).unwrap() call with the unchecked conversion (keeping the comment
that ascii_buf contains only ASCII bytes) and ensure the unsafe usage is
narrowly scoped.
| /// Optimized translate for constant `from` and `to` arguments: uses a pre-built | ||
| /// translation map instead of rebuilding it for every row. When an ASCII byte | ||
| /// lookup table is provided, ASCII input rows the lookup table; non-ASCII | ||
| /// inputs fallback to using the map. |
There was a problem hiding this comment.
Typo in doc comment.
Line 305: "ASCII input rows the lookup table" → "ASCII input rows use the lookup table".
📝 Fix
-/// lookup table is provided, ASCII input rows the lookup table; non-ASCII
+/// lookup table is provided, ASCII input rows use the lookup table; non-ASCII📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Optimized translate for constant `from` and `to` arguments: uses a pre-built | |
| /// translation map instead of rebuilding it for every row. When an ASCII byte | |
| /// lookup table is provided, ASCII input rows the lookup table; non-ASCII | |
| /// inputs fallback to using the map. | |
| /// Optimized translate for constant `from` and `to` arguments: uses a pre-built | |
| /// translation map instead of rebuilding it for every row. When an ASCII byte | |
| /// lookup table is provided, ASCII input rows use the lookup table; non-ASCII | |
| /// inputs fallback to using the map. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@datafusion/functions/src/unicode/translate.rs` around lines 303 - 306, Fix
the typo in the doc comment above the optimized translate implementation in
translate.rs: change the phrase "ASCII input rows the lookup table" to "ASCII
input rows use the lookup table" in the block describing the ASCII byte lookup
table behavior for constant `from`/`to` optimized translate.
20305: To review by AI