Skip to content

19832: perf: optimize spark_hex dictionary path by avoiding dictionary expansion#189

Open
martin-augment wants to merge 2 commits intomainfrom
pr-19832-2026-01-16-10-02-26
Open

19832: perf: optimize spark_hex dictionary path by avoiding dictionary expansion#189
martin-augment wants to merge 2 commits intomainfrom
pr-19832-2026-01-16-10-02-26

Conversation

@martin-augment
Copy link
Owner

19832: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

Walkthrough

The SparkHex UDF is modified to support dictionary-encoded inputs. The return type logic is updated to conditionally return a Dictionary type when the input argument is dictionary-encoded, otherwise returning Utf8. The compute implementation now handles dictionary inputs by extracting values, encoding them, and reconstructing the dictionary with the encoded results. Existing test cases are updated to use dictionary builders and assertions. A new test case is added to verify the function's behavior with dictionary-encoded string inputs containing multiple values and null entries.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 16, 2026

Code Review: Optimize spark_hex Dictionary Path

Thank you for this optimization! This PR improves performance by avoiding dictionary expansion when processing dictionary-encoded columns. Here's my detailed review:


Strengths

  1. Excellent Performance Optimization: Avoiding dictionary expansion is the right approach - this preserves the dictionary encoding benefits (reduced memory usage and faster processing).

  2. Proper Return Type Handling: The return_type method correctly preserves dictionary types (lines 95-102), matching the pattern used in other functions like space.

  3. Comprehensive Test Coverage: All three dictionary value types (Utf8, Int64, Binary) are tested, including null handling in test_dict_values_null.

  4. Consistent Code Style: The implementation follows DataFusion conventions and matches similar patterns in the codebase.

  5. Good Integration Test: The new sqllogictest verifies end-to-end functionality with dictionary columns.


🔍 Issues & Concerns

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

Location: datafusion/spark/src/function/math/hex.rs:247

The code assumes all dictionaries use Int32Type keys:

DataType::Dictionary(_, _) => {
    let dict = as_dictionary_array::<Int32Type>(&array);

Problem: This will panic/fail at runtime if a dictionary with a different key type (Int8, Int16, Int64, UInt32) is passed, even though the return type correctly preserves the key type.

Test Gap: Add tests for dictionaries with different key types.


2. Missing dictionary type coverage

Location: datafusion/spark/src/function/math/hex.rs:250-269

The dictionary branch only handles Int64, Utf8, and Binary, but the non-dictionary code also handles Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary.

Impact: Dictionary-encoded columns with these value types will fail with an error.

Recommendation: Add support for all types in the dictionary branch OR document why these are intentionally unsupported.


🧪 Test Coverage Assessment

Good: Dictionary with string/int64/binary values, nulls, integration test

Missing: Dictionary with different key types, unsupported value types, empty dictionary, all null values


Performance Considerations

Excellent: This optimization should provide significant benefits by avoiding dictionary expansion and preserving memory efficiency.


🎯 Recommendations Priority

  1. MUST FIX: Handle non-Int32 dictionary key types (Issue 18481: feat: Add selectivity metric to NestedLoopJoinExec for EXPLAIN ANALYZE #1)
  2. SHOULD FIX: Add missing dictionary type support or document limitations (Issue 18457: Support Arrow IPC Stream Files #2)
  3. NICE TO HAVE: Add comprehensive edge case tests

Overall Assessment

This is a solid performance optimization with the right architectural approach. The main blocker is the hard-coded Int32Type assumption which could cause runtime failures. Once that's addressed, this will be a valuable improvement.

Recommended Action: Request changes to fix the key type handling before merging.

Great work on identifying and implementing this optimization! 🚀

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@datafusion/spark/src/function/math/hex.rs`:
- Around line 246-278: The DataType::Dictionary branch currently hardcodes
as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with
other integer key types; update the match to either (preferred) pattern-match
the dictionary key type and call as_dictionary_array with the matching integer
type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>,
::<Int64Type>) and then proceed with encoding and dict.with_values, or
(alternative) enforce/support-only-Int32 by adding a validation in return_type
that rejects non-Int32 dictionary key types so this branch cannot be reached for
other key types; reference the DataType::Dictionary match arm and the
as_dictionary_array::<Int32Type> call when making the change.
🧹 Nitpick comments (1)
datafusion/spark/src/function/math/hex.rs (1)

250-268: Consider supporting additional dictionary value types for consistency.

The non-dictionary path supports Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary, but the dictionary path only handles Int64, Utf8, and Binary. Consider adding support for the missing types for consistency, or document why they're intentionally excluded.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1262c68 and 8b5087e.

📒 Files selected for processing (2)
  • datafusion/spark/src/function/math/hex.rs
  • datafusion/sqllogictest/test_files/spark/math/hex.slt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (3)
datafusion/spark/src/function/math/hex.rs (2)

95-102: LGTM on return type logic.

The return type correctly propagates the dictionary key type while replacing the value type with Utf8 for hex-encoded output. This maintains dictionary encoding through the transformation.


302-328: Good test coverage for dictionary paths.

The tests appropriately cover:

  • Dictionary with Utf8 values
  • Dictionary with Int64 values
  • Dictionary with Binary values
  • Null handling in dictionary values

Note: All tests use Int32Type keys, which aligns with the current implementation limitation mentioned above.

Also applies to: 331-357, 360-386, 426-447

datafusion/sqllogictest/test_files/spark/math/hex.slt (1)

67-80: Good integration test for dictionary-encoded input.

The test effectively validates:

  • Dictionary encoding with Int32 keys (matches the implementation)
  • Duplicate values ('foo' and 'bar' appear twice) to verify dictionary structure is preserved
  • NULL handling within the dictionary
  • Correct hex encoding of string values

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +246 to 278
DataType::Dictionary(_, _) => {
let dict = as_dictionary_array::<Int32Type>(&array);
let dict_values = dict.values();

match **value_type {
let encoded_values: ColumnarValue = match dict_values.data_type() {
DataType::Int64 => {
let arr = dict.downcast_dict::<Int64Array>().unwrap();
hex_encode_int64(arr.into_iter(), dict.len())
let arr = as_int64_array(dict_values)?;
hex_encode_int64(arr.iter(), arr.len())?
}
DataType::Utf8 => {
let arr = dict.downcast_dict::<StringArray>().unwrap();
hex_encode_bytes(arr.into_iter(), lowercase, dict.len())
let arr = as_string_array(dict_values);
hex_encode_bytes(arr.iter(), lowercase, arr.len())?
}
DataType::Binary => {
let arr = dict.downcast_dict::<BinaryArray>().unwrap();
hex_encode_bytes(arr.into_iter(), lowercase, dict.len())
let arr = as_binary_array(dict_values)?;
hex_encode_bytes(arr.iter(), lowercase, arr.len())?
}
_ => {
exec_err!(
return exec_err!(
"hex got an unexpected argument type: {}",
array.data_type()
)
dict_values.data_type()
);
}
}
};

let encoded_values_array: ArrayRef = match encoded_values {
ColumnarValue::Array(a) => a,
ColumnarValue::Scalar(s) => Arc::new(s.to_array()?),
};

let new_dict = dict.with_values(encoded_values_array);
Ok(ColumnarValue::Array(Arc::new(new_dict)))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Dictionary key type is hardcoded to Int32Type, but return_type accepts any key type.

The return_type method at line 97-99 accepts dictionaries with any key type and returns a dictionary preserving that key type. However, this compute path hardcodes Int32Type at line 247. If a dictionary with Int8Type, Int16Type, or Int64Type keys is passed, this will panic or produce incorrect results at runtime.

🔧 Suggested fix: Match on key type or add validation

Option 1: Handle multiple key types explicitly:

-            DataType::Dictionary(_, _) => {
-                let dict = as_dictionary_array::<Int32Type>(&array);
+            DataType::Dictionary(key_type, _) => {
+                macro_rules! handle_dict_key_type {
+                    ($key_type:ty) => {{
+                        let dict = as_dictionary_array::<$key_type>(&array);
+                        // ... rest of handling
+                    }};
+                }
+                match key_type.as_ref() {
+                    DataType::Int8 => handle_dict_key_type!(Int8Type),
+                    DataType::Int16 => handle_dict_key_type!(Int16Type),
+                    DataType::Int32 => handle_dict_key_type!(Int32Type),
+                    DataType::Int64 => handle_dict_key_type!(Int64Type),
+                    // ... etc
+                }

Option 2: If only Int32Type is intended to be supported, add validation in return_type:

 fn return_type(&self, arg_types: &[DataType]) -> datafusion_common::Result<DataType> {
     Ok(match &arg_types[0] {
-        DataType::Dictionary(key_type, _) => {
+        DataType::Dictionary(key_type, _) if **key_type == DataType::Int32 => {
             DataType::Dictionary(key_type.clone(), Box::new(DataType::Utf8))
         }
         _ => DataType::Utf8,
     })
 }
🤖 Prompt for AI Agents
In `@datafusion/spark/src/function/math/hex.rs` around lines 246 - 278, The
DataType::Dictionary branch currently hardcodes
as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with
other integer key types; update the match to either (preferred) pattern-match
the dictionary key type and call as_dictionary_array with the matching integer
type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>,
::<Int64Type>) and then proceed with encoding and dict.with_values, or
(alternative) enforce/support-only-Int32 by adding a validation in return_type
that rejects non-Int32 dictionary key types so this branch cannot be reached for
other key types; reference the DataType::Dictionary match arm and the
as_dictionary_array::<Int32Type> call when making the change.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

@augmentcode
Copy link

augmentcode bot commented Jan 16, 2026

🤖 Augment PR Summary

Summary: Improves performance of Spark-compatible hex by avoiding expansion of dictionary-encoded inputs.

Changes:

  • Updates SparkHex::return_type to return Dictionary(K, Utf8) when the argument is dictionary-typed, preserving the key type.
  • Refactors dictionary handling in compute_hex to hex-encode only dict.values() and then rebuild the dictionary with the original keys.
  • Updates Rust unit tests and adds a sqllogictest case to cover hex(Dictionary(Int32, Utf8)).

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

hex_encode_bytes(array.iter(), lowercase, array.len())
}
DataType::Dictionary(_, value_type) => {
DataType::Dictionary(_, _) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary path downcasts via as_dictionary_array::<Int32Type> (and return_type now preserves the input key type), so a non-Int32 dictionary input could panic at runtime. Consider validating/rejecting non-Int32 dictionary keys or supporting dictionary keys generically here.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

@martin-augment
Copy link
Owner Author

1. Potential Bug: Hard-coded Int32Type assumption (HIGH PRIORITY)

Location: datafusion/spark/src/function/math/hex.rs:247

The code assumes all dictionaries use Int32Type keys:

DataType::Dictionary(_, _) => {
    let dict = as_dictionary_array::<Int32Type>(&array);

Problem: This will panic/fail at runtime if a dictionary with a different key type (Int8, Int16, Int64, UInt32) is passed, even though the return type correctly preserves the key type.

Test Gap: Add tests for dictionaries with different key types.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.

@martin-augment
Copy link
Owner Author

250-268: Consider supporting additional dictionary value types for consistency.

The non-dictionary path supports Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary, but the dictionary path only handles Int64, Utf8, and Binary. Consider adding support for the missing types for consistency, or document why they're intentionally excluded.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature.

@martin-augment
Copy link
Owner Author

2. Missing dictionary type coverage

Location: datafusion/spark/src/function/math/hex.rs:250-269

The dictionary branch only handles Int64, Utf8, and Binary, but the non-dictionary code also handles Utf8View, LargeUtf8, LargeBinary, and FixedSizeBinary.

Impact: Dictionary-encoded columns with these value types will fail with an error.

Recommendation: Add support for all types in the dictionary branch OR document why these are intentionally unsupported.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments