19832: perf: optimize `spark_hex` dictionary path by avoiding dictionary expansion by martin-augment · Pull Request #189 · martin-augment/datafusion

martin-augment · 2026-01-16T10:02:30Z

19832: To review by AI

coderabbitai · 2026-01-16T10:03:07Z

Walkthrough

The SparkHex UDF is modified to support dictionary-encoded inputs. The return type logic is updated to conditionally return a Dictionary type when the input argument is dictionary-encoded, otherwise returning Utf8. The compute implementation now handles dictionary inputs by extracting values, encoding them, and reconstructing the dictionary with the encoded results. Existing test cases are updated to use dictionary builders and assertions. A new test case is added to verify the function's behavior with dictionary-encoded string inputs containing multiple values and null entries.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-16T10:04:37Z

Code Review: Optimize `spark_hex` Dictionary Path

Thank you for this optimization! This PR improves performance by avoiding dictionary expansion when processing dictionary-encoded columns. Here's my detailed review:

✅ Strengths

Excellent Performance Optimization: Avoiding dictionary expansion is the right approach - this preserves the dictionary encoding benefits (reduced memory usage and faster processing).
Proper Return Type Handling: The return_type method correctly preserves dictionary types (lines 95-102), matching the pattern used in other functions like space.
Comprehensive Test Coverage: All three dictionary value types (Utf8, Int64, Binary) are tested, including null handling in test_dict_values_null.
Consistent Code Style: The implementation follows DataFusion conventions and matches similar patterns in the codebase.
Good Integration Test: The new sqllogictest verifies end-to-end functionality with dictionary columns.

🔍 Issues & Concerns

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

Location: datafusion/spark/src/function/math/hex.rs:247

The code assumes all dictionaries use Int32Type keys:

DataType::Dictionary(_, _) => {
    let dict = as_dictionary_array::<Int32Type>(&array);

Problem: This will panic/fail at runtime if a dictionary with a different key type (Int8, Int16, Int64, UInt32) is passed, even though the return type correctly preserves the key type.

Test Gap: Add tests for dictionaries with different key types.

2. Missing dictionary type coverage

Location: datafusion/spark/src/function/math/hex.rs:250-269

The dictionary branch only handles Int64, Utf8, and Binary, but the non-dictionary code also handles Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary.

Impact: Dictionary-encoded columns with these value types will fail with an error.

Recommendation: Add support for all types in the dictionary branch OR document why these are intentionally unsupported.

🧪 Test Coverage Assessment

Good: Dictionary with string/int64/binary values, nulls, integration test

Missing: Dictionary with different key types, unsupported value types, empty dictionary, all null values

⚡ Performance Considerations

Excellent: This optimization should provide significant benefits by avoiding dictionary expansion and preserving memory efficiency.

🎯 Recommendations Priority

MUST FIX: Handle non-Int32 dictionary key types (Issue 18481: feat: Add selectivity metric to NestedLoopJoinExec for EXPLAIN ANALYZE #1)
SHOULD FIX: Add missing dictionary type support or document limitations (Issue 18457: Support Arrow IPC Stream Files #2)
NICE TO HAVE: Add comprehensive edge case tests

✅ Overall Assessment

This is a solid performance optimization with the right architectural approach. The main blocker is the hard-coded Int32Type assumption which could cause runtime failures. Once that's addressed, this will be a valuable improvement.

Recommended Action: Request changes to fix the key type handling before merging.

Great work on identifying and implementing this optimization! 🚀

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@datafusion/spark/src/function/math/hex.rs`:
- Around line 246-278: The DataType::Dictionary branch currently hardcodes
as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with
other integer key types; update the match to either (preferred) pattern-match
the dictionary key type and call as_dictionary_array with the matching integer
type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>,
::<Int64Type>) and then proceed with encoding and dict.with_values, or
(alternative) enforce/support-only-Int32 by adding a validation in return_type
that rejects non-Int32 dictionary key types so this branch cannot be reached for
other key types; reference the DataType::Dictionary match arm and the
as_dictionary_array::<Int32Type> call when making the change.

🧹 Nitpick comments (1)

datafusion/spark/src/function/math/hex.rs (1)

250-268: Consider supporting additional dictionary value types for consistency.

The non-dictionary path supports Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary, but the dictionary path only handles Int64, Utf8, and Binary. Consider adding support for the missing types for consistency, or document why they're intentionally excluded.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1262c68 and 8b5087e.

📒 Files selected for processing (2)

datafusion/spark/src/function/math/hex.rs
datafusion/sqllogictest/test_files/spark/math/hex.slt

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review
GitHub Check: Analyze (rust)

🔇 Additional comments (3)

datafusion/spark/src/function/math/hex.rs (2)

95-102: LGTM on return type logic.

The return type correctly propagates the dictionary key type while replacing the value type with Utf8 for hex-encoded output. This maintains dictionary encoding through the transformation.

302-328: Good test coverage for dictionary paths.

The tests appropriately cover:

Dictionary with Utf8 values

Dictionary with Int64 values

Dictionary with Binary values

Null handling in dictionary values

Note: All tests use Int32Type keys, which aligns with the current implementation limitation mentioned above.

Also applies to: 331-357, 360-386, 426-447

datafusion/sqllogictest/test_files/spark/math/hex.slt (1)

67-80: Good integration test for dictionary-encoded input.

The test effectively validates:

Dictionary encoding with Int32 keys (matches the implementation)

Duplicate values ('foo' and 'bar' appear twice) to verify dictionary structure is preserved

NULL handling within the dictionary

Correct hex encoding of string values

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T10:06:44Z

datafusion/spark/src/function/math/hex.rs

+            DataType::Dictionary(_, _) => {
                let dict = as_dictionary_array::<Int32Type>(&array);
+                let dict_values = dict.values();

-                match **value_type {
+                let encoded_values: ColumnarValue = match dict_values.data_type() {
                    DataType::Int64 => {
-                        let arr = dict.downcast_dict::<Int64Array>().unwrap();
-                        hex_encode_int64(arr.into_iter(), dict.len())
+                        let arr = as_int64_array(dict_values)?;
+                        hex_encode_int64(arr.iter(), arr.len())?
                    }
                    DataType::Utf8 => {
-                        let arr = dict.downcast_dict::<StringArray>().unwrap();
-                        hex_encode_bytes(arr.into_iter(), lowercase, dict.len())
+                        let arr = as_string_array(dict_values);
+                        hex_encode_bytes(arr.iter(), lowercase, arr.len())?
                    }
                    DataType::Binary => {
-                        let arr = dict.downcast_dict::<BinaryArray>().unwrap();
-                        hex_encode_bytes(arr.into_iter(), lowercase, dict.len())
+                        let arr = as_binary_array(dict_values)?;
+                        hex_encode_bytes(arr.iter(), lowercase, arr.len())?
                    }
                    _ => {
-                        exec_err!(
+                        return exec_err!(
                            "hex got an unexpected argument type: {}",
-                            array.data_type()
-                        )
+                            dict_values.data_type()
+                        );
                    }
-                }
+                };
+
+                let encoded_values_array: ArrayRef = match encoded_values {
+                    ColumnarValue::Array(a) => a,
+                    ColumnarValue::Scalar(s) => Arc::new(s.to_array()?),
+                };
+
+                let new_dict = dict.with_values(encoded_values_array);
+                Ok(ColumnarValue::Array(Arc::new(new_dict)))
            }


⚠️ Potential issue | 🟠 Major

Dictionary key type is hardcoded to Int32Type, but return_type accepts any key type.

The return_type method at line 97-99 accepts dictionaries with any key type and returns a dictionary preserving that key type. However, this compute path hardcodes Int32Type at line 247. If a dictionary with Int8Type, Int16Type, or Int64Type keys is passed, this will panic or produce incorrect results at runtime.

🔧 Suggested fix: Match on key type or add validation

Option 1: Handle multiple key types explicitly:

- DataType::Dictionary(_, _) => { - let dict = as_dictionary_array::<Int32Type>(&array); + DataType::Dictionary(key_type, _) => { + macro_rules! handle_dict_key_type { + ($key_type:ty) => {{ + let dict = as_dictionary_array::<$key_type>(&array); + // ... rest of handling + }}; + } + match key_type.as_ref() { + DataType::Int8 => handle_dict_key_type!(Int8Type), + DataType::Int16 => handle_dict_key_type!(Int16Type), + DataType::Int32 => handle_dict_key_type!(Int32Type), + DataType::Int64 => handle_dict_key_type!(Int64Type), + // ... etc + }

Option 2: If only Int32Type is intended to be supported, add validation in return_type:

fn return_type(&self, arg_types: &[DataType]) -> datafusion_common::Result<DataType> { Ok(match &arg_types[0] { - DataType::Dictionary(key_type, _) => { + DataType::Dictionary(key_type, _) if **key_type == DataType::Int32 => { DataType::Dictionary(key_type.clone(), Box::new(DataType::Utf8)) } _ => DataType::Utf8, }) }

🤖 Prompt for AI Agents

In `@datafusion/spark/src/function/math/hex.rs` around lines 246 - 278, The DataType::Dictionary branch currently hardcodes as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with other integer key types; update the match to either (preferred) pattern-match the dictionary key type and call as_dictionary_array with the matching integer type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>, ::<Int64Type>) and then proceed with encoding and dict.with_values, or (alternative) enforce/support-only-Int32 by adding a validation in return_type that rejects non-Int32 dictionary key types so this branch cannot be reached for other key types; reference the DataType::Dictionary match arm and the as_dictionary_array::<Int32Type> call when making the change.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

augmentcode · 2026-01-16T10:08:12Z

🤖 Augment PR Summary

Summary: Improves performance of Spark-compatible hex by avoiding expansion of dictionary-encoded inputs.

Changes:

Updates SparkHex::return_type to return Dictionary(K, Utf8) when the argument is dictionary-typed, preserving the key type.
Refactors dictionary handling in compute_hex to hex-encode only dict.values() and then rebuild the dictionary with the original keys.
Updates Rust unit tests and adds a sqllogictest case to cover hex(Dictionary(Int32, Utf8)).

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-16T10:08:13Z

datafusion/spark/src/function/math/hex.rs

                hex_encode_bytes(array.iter(), lowercase, array.len())
            }
-            DataType::Dictionary(_, value_type) => {
+            DataType::Dictionary(_, _) => {


This dictionary path downcasts via as_dictionary_array::<Int32Type> (and return_type now preserves the input key type), so a non-Int32 dictionary input could panic at runtime. Consider validating/rejecting non-Int32 dictionary keys or supporting dictionary keys generically here.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

martin-augment · 2026-01-16T11:53:11Z

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

Location: datafusion/spark/src/function/math/hex.rs:247

The code assumes all dictionaries use Int32Type keys:
DataType::Dictionary(_, _) => {
    let dict = as_dictionary_array::<Int32Type>(&array);
Problem: This will panic/fail at runtime if a dictionary with a different key type (Int8, Int16, Int64, UInt32) is passed, even though the return type correctly preserves the key type.

Test Gap: Add tests for dictionaries with different key types.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

martin-augment · 2026-01-16T11:55:34Z

250-268: Consider supporting additional dictionary value types for consistency.

The non-dictionary path supports Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary, but the dictionary path only handles Int64, Utf8, and Binary. Consider adding support for the missing types for consistency, or document why they're intentionally excluded.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature.

martin-augment · 2026-01-16T11:56:04Z

2. Missing dictionary type coverage

Location: datafusion/spark/src/function/math/hex.rs:250-269

The dictionary branch only handles Int64, Utf8, and Binary, but the non-dictionary code also handles Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary.

Impact: Dictionary-encoded columns with these value types will fail with an error.

Recommendation: Add support for all types in the dictionary branch OR document why these are intentionally unsupported.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature.

lyne7-sc added 2 commits January 15, 2026 23:05

optimize spark_hex dictionary path

3b465c3

cargo fmt

8b5087e

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

augmentcode bot reviewed Jan 16, 2026

View reviewed changes

Conversation

martin-augment commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 16, 2026

Code Review: Optimize spark_hex Dictionary Path

✅ Strengths

🔍 Issues & Concerns

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

2. Missing dictionary type coverage

🧪 Test Coverage Assessment

⚡ Performance Considerations

🎯 Recommendations Priority

✅ Overall Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Jan 16, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 16, 2026

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

Uh oh!

martin-augment commented Jan 16, 2026

Uh oh!

martin-augment commented Jan 16, 2026

2. Missing dictionary type coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

coderabbitai bot commented Jan 16, 2026 •

edited

Loading

Code Review: Optimize `spark_hex` Dictionary Path