Skip to content

20305: perf: Optimize translate() UDF for scalar inputs#245

Open
martin-augment wants to merge 5 commits intomainfrom
pr-20305-2026-02-17-13-11-54
Open

20305: perf: Optimize translate() UDF for scalar inputs#245
martin-augment wants to merge 5 commits intomainfrom
pr-20305-2026-02-17-13-11-54

Conversation

@martin-augment
Copy link
Owner

20305: To review by AI

AFAIK there is no reason not to support this.
When the second and third arguments are constants (which is common), we
can build the lookup table once, rather than rebuilding it for every
input row.

When all of the arguments are ASCII-only, we can do lookups via a
fixed-size lookup table that directly maps ASCII byte values, rather
than a hash table.
@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the translate() UDF by introducing specialized optimizations for scalar from and to arguments. It leverages pre-computed translation maps and an ASCII-specific byte-level lookup table to speed up string translation operations. The changes also include comprehensive updates to benchmarks and documentation to reflect these improvements and clarify usage.

Highlights

  • Performance Optimization for translate() UDF: Implemented an optimized execution path for the translate() User-Defined Function (UDF) when its from and to arguments are scalar values. This pre-builds the translation map once, avoiding redundant work for each row.
  • ASCII-Specific Optimization: Introduced a highly efficient byte-level lookup table for translate() when both the from and to arguments, as well as the input string, are entirely ASCII. This provides a fast path for common ASCII translation scenarios.
  • Expanded Benchmarking: Updated the translate benchmarks to include test cases for scalar from and to arguments, as well as additional string lengths (128 and 1024), to properly measure the performance gains of the new optimizations.
  • Documentation and Type Support Updates: Revised the documentation for the translate() function to clarify argument names (chars to from, translation to to) and their descriptions. Added support for the LargeUtf8 data type in the translate() function.
Changelog
  • datafusion/functions/benches/translate.rs
    • Renamed create_args function to create_args_array_from_to for clarity.
    • Added a new function create_args_scalar_from_to to generate arguments for scalar from and to inputs.
    • Expanded the benchmark suite to include tests for scalar_from_to scenarios and additional string lengths (128, 1024).
  • datafusion/functions/src/unicode/translate.rs
    • Updated user_doc description and syntax_example for the translate function.
    • Renamed chars argument to from and translation to to in the user_doc attributes.
    • Clarified the descriptions for the from and to arguments in the user_doc.
    • Added support for LargeUtf8 data type to the translate function's signature.
    • Implemented an optimized code path for translate when from and to arguments are scalar values.
    • Introduced a helper function try_as_scalar_str to extract string values from scalar columnar values.
    • Added build_ascii_translate_table function to create a byte-level lookup table for ASCII translations.
    • Implemented translate_with_map function for optimized translation using pre-built maps and an optional ASCII table.
    • Modified the from_map building logic to correctly handle and ignore duplicate characters.
    • Added a new test case to verify the translate_with_map function's grapheme fallback for non-ASCII input with ASCII scalar from/to.
  • docs/source/user-guide/sql/scalar_functions.md
    • Updated the description of the translate function to reflect its character-wise substitution behavior.
    • Renamed the chars argument to from and translation to to in the documentation.
    • Updated the descriptions for the from and to arguments to provide more detailed explanations of their roles.
Activity
  • The pull request was created by martin-augment.
  • The title indicates a performance optimization for the translate() UDF, specifically for scalar inputs.
  • The description explicitly states 'To review by AI', suggesting a request for automated review or assistance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

Walkthrough

The changes refactor and enhance the translate function implementation across benchmarks, core function logic, and documentation. The benchmark suite expands to test additional string lengths and separates array-based from scalar-based translation paths. The core function adds support for LargeUtf8 input types, implements an optimized fast path for constant ASCII inputs using precomputed lookup tables, and provides a grapheme-based fallback for non-ASCII content. Parameter names are standardized from "chars/translation" to "from/to" across the codebase. Documentation and test coverage are updated accordingly.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20305-2026-02-17-13-11-54

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 17, 2026

Code Review: Optimize translate() UDF for Scalar Inputs

Good overall approach — pre-building the translation map once for scalar from/to arguments is a sound optimization, and the ASCII fast-path is a nice touch. Here are my findings:


Bugs / Correctness Issues

1. Incorrect null handling when from or to is a null scalar

File: datafusion/functions/src/unicode/translate.rs, invoke_with_args

try_as_scalar_str returns None for both a null scalar and for a non-scalar (array) columnar value. When from is a null scalar, try_as_scalar_str returns None, causing the code to silently fall through to make_scalar_function(invoke_translate, vec![]). This is probably fine in practice (the existing path handles nulls), but the logic conflates two very different cases: "not a scalar" vs. "scalar but null".

The optimized path should only be entered when both from and to are non-null scalars and (importantly) when the first argument is either a column or a non-null scalar. The current structure is fine for the column case, but if all three arguments are null scalars, the fallthrough to invoke_translate still produces the correct answer only by coincidence. Consider documenting this invariant explicitly or restructuring the guard.

2. Unnecessary clone of ascii_buf

File: translate_with_map, line ~336

return String::from_utf8(ascii_buf.clone()).unwrap();

String::from_utf8 takes ownership of a Vec<u8>, so cloning here allocates an extra buffer on every ASCII row — exactly what the fast path is trying to avoid. Use String::from_utf8(ascii_buf.clone())String::from_utf8_unchecked (with a safety comment) or simply clone only the slice:

// SAFETY: ascii_buf contains only mapped ASCII bytes, all valid UTF-8.
return unsafe { String::from_utf8_unchecked(ascii_buf.clone()) };

Or, since ascii_buf is reused across iterations and cannot be consumed, copy via the slice API instead:

// SAFETY: all bytes are ASCII, hence valid UTF-8.
return unsafe { std::str::from_utf8_unchecked(&ascii_buf).to_owned() };

Either approach avoids the double allocation.

3. Incorrect comment in translate_with_map doc

File: line ~305

"ASCII input rows the lookup table"

This appears to be a typo — likely "ASCII input uses the lookup table".


Design / Structural Issues

4. LargeUtf8 signature only covers scalar from/to, not arrays

The new type signature Exact(vec![LargeUtf8, Utf8, Utf8]) added in TranslateFunc::new() only allows Utf8 (not LargeUtf8) for the from and to arguments. The existing invoke_translatetranslate::<i64, _> path casts from_array and to_array as as_string::<i64>(), which is consistent with the old behavior. However, the new signature implicitly documents that from/to will always be Utf8 even when str is LargeUtf8. This asymmetry should be mentioned in a comment, or a broader set of signatures should be considered for consistency.

5. Inconsistency: Utf8View not supported in slow-path for LargeUtf8

When from/to are arrays (not scalars), invoke_translate for LargeUtf8 input calls as_string::<i64>() on the from_array and to_array, but those are typed as Utf8 per the signature. The existing code before this PR may have had the same issue, but expanding LargeUtf8 support without auditing the array-path behavior is a risk.


Performance Notes

6. string_graphemes buffer is used but not strictly needed

In translate_with_map's slow path, graphemes are collected into string_graphemes and then iterated. You could iterate s.graphemes(true) directly without the intermediate Vec, saving an allocation per row on the slow path (mirrors how the translate function already does this with the extend + loop pattern, though that pattern also has the extra buffer — worth addressing there too).


Test Coverage Gaps

7. No test for LargeUtf8 input

The new LargeUtf8 signature is untested. A test with ScalarValue::LargeUtf8(Some(...)) should be added.

8. No test for array from/to with LargeUtf8 input

The slow path through invoke_translate for LargeUtf8 is also uncovered.

9. No test for the ASCII fast-path specifically

The new test ("café" with ASCII from/to) exercises the grapheme fallback (because the input is non-ASCII). There is no test that exercises the ASCII table fast-path — e.g., a pure-ASCII input with ASCII from/to scalars. While existing tests like "12345""a2x5" happen to go through the fast path, this is worth making explicit.

10. Benchmark size parameter not exercised in the benchmark output label

size appears in the group name but is fixed at 1024 elements inside the loop. The benchmark label format!("array_from_to [str_len={str_len}]") omits size. This is minor but makes Criterion output less clear when sizes change.


Documentation

The rename from chars/translation to from/to and the improved argument descriptions are good improvements and match PostgreSQL's convention. The docs update is consistent.


Summary

Severity Issue
Bug Unnecessary ascii_buf.clone() doubles allocation in the hot path (#2)
Bug Typo in doc comment "rows" → "uses" (#3)
Design Null scalar vs. array ambiguity in try_as_scalar_str (#1)
Design LargeUtf8 signature/array-path consistency (#4, #5)
Perf string_graphemes Vec is unnecessary in slow path (#6)
Tests Missing coverage for LargeUtf8, array args, and ASCII fast-path (#7#9)

The most important fix before merging is #2 (the clone in the fast path undermines the optimization). The rest are improvements worth addressing.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for the translate() UDF, specifically for cases where the from and to arguments are scalar values. It achieves this by pre-computing the translation map and adding a fast path for ASCII-only translations. The changes are well-structured, including new benchmarks to validate the performance gains and updated documentation.

I've found one area for a minor improvement in the new implementation to make it even more efficient by removing an intermediate allocation. Overall, this is a high-quality contribution.

Comment on lines +340 to +353
string_graphemes.clear();
result_graphemes.clear();

string_graphemes.extend(s.graphemes(true));
for c in &string_graphemes {
match from_map.get(*c) {
Some(n) => {
if let Some(replacement) = to_graphemes.get(*n) {
result_graphemes.push(*replacement);
}
}
None => result_graphemes.push(*c),
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string_graphemes vector seems unnecessary. You can iterate directly over the graphemes of the input string s to build result_graphemes. This avoids populating an intermediate vector for each row, which should be more efficient.

You would also need to remove the declaration of string_graphemes at line 316.

                result_graphemes.clear();

                for c in s.graphemes(true) {
                    match from_map.get(c) {
                        Some(n) => {
                            if let Some(replacement) = to_graphemes.get(*n) {
                                result_graphemes.push(*replacement);
                            }
                        }
                        None => result_graphemes.push(c),
                    }
                }

@augmentcode
Copy link

augmentcode bot commented Feb 17, 2026

🤖 Augment PR Summary

Summary: This PR optimizes the translate() scalar UDF when the from and to arguments are constants.

Changes:

  • Adds a fast path that pre-builds the from→index map once per invocation when from/to are scalar values
  • Introduces an optional fixed-size ASCII lookup table for byte-level translation when both mapping strings are ASCII
  • Extends the UDF signature to accept LargeUtf8 input for str
  • Adds benchmark cases comparing array vs scalar from/to inputs over multiple string lengths
  • Updates user-facing documentation to use from/to terminology and clarify deletion semantics when from is longer than to
  • Adds a unit test to cover the non-ASCII input / ASCII-mapping fallback path

Technical Notes: The optimized path reuses the existing grapheme-based behavior for non-ASCII inputs, while accelerating ASCII-only cases via direct byte substitution.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

vec![
Exact(vec![Utf8View, Utf8, Utf8]),
Exact(vec![Utf8, Utf8, Utf8]),
Exact(vec![LargeUtf8, Utf8, Utf8]),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new signature arm Exact(vec![LargeUtf8, Utf8, Utf8]) looks inconsistent with invoke_translate’s DataType::LargeUtf8 branch, which downcasts from/to via as_string::<i64>(). If from/to are arrays (as allowed by the signature), this will likely panic due to offset-size mismatch (Utf8/i32 vs LargeUtf8/i64).

Severity: high

Other Locations
  • datafusion/functions/src/unicode/translate.rs:194
  • datafusion/functions/src/unicode/translate.rs:195

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/functions/src/unicode/translate.rs (1)

192-197: ⚠️ Potential issue | 🔴 Critical

Bug: invoke_translate LargeUtf8 branch casts from/to arrays with wrong offset type.

The signature at line 74 declares Exact(vec![LargeUtf8, Utf8, Utf8]), where from and to are Utf8 (i32 offsets). However, the invoke_translate fallback path (lines 192–197) incorrectly casts them as as_string::<i64>() when the first argument is LargeUtf8. This will panic at runtime when from/to are non-scalar Utf8 arrays paired with a LargeUtf8 first argument.

Proposed fix
         DataType::LargeUtf8 => {
             let string_array = args[0].as_string::<i64>();
-            let from_array = args[1].as_string::<i64>();
-            let to_array = args[2].as_string::<i64>();
+            let from_array = args[1].as_string::<i32>();
+            let to_array = args[2].as_string::<i32>();
             translate::<i64, _, _>(string_array, from_array, to_array)
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/unicode/translate.rs` around lines 192 - 197, The
LargeUtf8 branch in invoke_translate casts the `from`/`to` arrays with the wrong
offset type (uses as_string::<i64>()); keep the first argument as string_array =
args[0].as_string::<i64>(), but cast args[1] and args[2] to as_string::<i32>()
and call translate with the correct generic offset types (e.g., translate::<i64,
i32, i32>) so the `LargeUtf8` first argument pairs with Utf8 `from`/`to`
correctly.
🧹 Nitpick comments (1)
datafusion/functions/src/unicode/translate.rs (1)

324-337: Avoid redundant UTF-8 validation on known-ASCII bytes.

ascii_buf is guaranteed to contain only ASCII bytes (each byte comes from the table which maps ASCII → ASCII or filters via ASCII_DELETE). The from_utf8 validation is unnecessary overhead on a performance-critical hot path. Consider using from_utf8_unchecked or at minimum a debug-only assertion.

♻️ Proposed change
-                    return String::from_utf8(ascii_buf.clone()).unwrap();
+                    // SAFETY: `ascii_buf` only contains bytes produced by the
+                    // ASCII translate table, which maps ASCII→ASCII and filters
+                    // via ASCII_DELETE; every byte is in 0..=127 and thus valid
+                    // UTF-8.
+                    debug_assert!(ascii_buf.is_ascii());
+                    return unsafe {
+                        String::from_utf8_unchecked(ascii_buf.clone())
+                    };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/unicode/translate.rs` around lines 324 - 337, The
code currently constructs ascii_buf from ascii_table and then calls
String::from_utf8(ascii_buf.clone()).unwrap(), performing redundant UTF-8
validation; change this to construct the String without runtime validation by
using String::from_utf8_unchecked(ascii_buf.clone()) inside an unsafe block (or,
if you prefer safety during development, add a debug-only assertion like
debug_assert!(std::str::from_utf8(&ascii_buf).is_ok()) before the unsafe
conversion). Locate the block referencing ascii_table, ASCII_DELETE, ascii_buf,
and s.is_ascii() in translate.rs and replace the from_utf8(...).unwrap() call
with the unchecked conversion (keeping the comment that ascii_buf contains only
ASCII bytes) and ensure the unsafe usage is narrowly scoped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 303-306: Fix the typo in the doc comment above the optimized
translate implementation in translate.rs: change the phrase "ASCII input rows
the lookup table" to "ASCII input rows use the lookup table" in the block
describing the ASCII byte lookup table behavior for constant `from`/`to`
optimized translate.

---

Outside diff comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 192-197: The LargeUtf8 branch in invoke_translate casts the
`from`/`to` arrays with the wrong offset type (uses as_string::<i64>()); keep
the first argument as string_array = args[0].as_string::<i64>(), but cast
args[1] and args[2] to as_string::<i32>() and call translate with the correct
generic offset types (e.g., translate::<i64, i32, i32>) so the `LargeUtf8` first
argument pairs with Utf8 `from`/`to` correctly.

---

Nitpick comments:
In `@datafusion/functions/src/unicode/translate.rs`:
- Around line 324-337: The code currently constructs ascii_buf from ascii_table
and then calls String::from_utf8(ascii_buf.clone()).unwrap(), performing
redundant UTF-8 validation; change this to construct the String without runtime
validation by using String::from_utf8_unchecked(ascii_buf.clone()) inside an
unsafe block (or, if you prefer safety during development, add a debug-only
assertion like debug_assert!(std::str::from_utf8(&ascii_buf).is_ok()) before the
unsafe conversion). Locate the block referencing ascii_table, ASCII_DELETE,
ascii_buf, and s.is_ascii() in translate.rs and replace the
from_utf8(...).unwrap() call with the unchecked conversion (keeping the comment
that ascii_buf contains only ASCII bytes) and ensure the unsafe usage is
narrowly scoped.

Comment on lines +303 to +306
/// Optimized translate for constant `from` and `to` arguments: uses a pre-built
/// translation map instead of rebuilding it for every row. When an ASCII byte
/// lookup table is provided, ASCII input rows the lookup table; non-ASCII
/// inputs fallback to using the map.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in doc comment.

Line 305: "ASCII input rows the lookup table" → "ASCII input rows use the lookup table".

📝 Fix
-/// lookup table is provided, ASCII input rows the lookup table; non-ASCII
+/// lookup table is provided, ASCII input rows use the lookup table; non-ASCII
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Optimized translate for constant `from` and `to` arguments: uses a pre-built
/// translation map instead of rebuilding it for every row. When an ASCII byte
/// lookup table is provided, ASCII input rows the lookup table; non-ASCII
/// inputs fallback to using the map.
/// Optimized translate for constant `from` and `to` arguments: uses a pre-built
/// translation map instead of rebuilding it for every row. When an ASCII byte
/// lookup table is provided, ASCII input rows use the lookup table; non-ASCII
/// inputs fallback to using the map.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/functions/src/unicode/translate.rs` around lines 303 - 306, Fix
the typo in the doc comment above the optimized translate implementation in
translate.rs: change the phrase "ASCII input rows the lookup table" to "ASCII
input rows use the lookup table" in the block describing the ASCII byte lookup
table behavior for constant `from`/`to` optimized translate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments