19832: perf: optimize spark_hex dictionary path by avoiding dictionary expansion#189
19832: perf: optimize spark_hex dictionary path by avoiding dictionary expansion#189martin-augment wants to merge 2 commits intomainfrom
spark_hex dictionary path by avoiding dictionary expansion#189Conversation
WalkthroughThe SparkHex UDF is modified to support dictionary-encoded inputs. The return type logic is updated to conditionally return a Dictionary type when the input argument is dictionary-encoded, otherwise returning Utf8. The compute implementation now handles dictionary inputs by extracting values, encoding them, and reconstructing the dictionary with the encoded results. Existing test cases are updated to use dictionary builders and assertions. A new test case is added to verify the function's behavior with dictionary-encoded string inputs containing multiple values and null entries. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: Optimize
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@datafusion/spark/src/function/math/hex.rs`:
- Around line 246-278: The DataType::Dictionary branch currently hardcodes
as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with
other integer key types; update the match to either (preferred) pattern-match
the dictionary key type and call as_dictionary_array with the matching integer
type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>,
::<Int64Type>) and then proceed with encoding and dict.with_values, or
(alternative) enforce/support-only-Int32 by adding a validation in return_type
that rejects non-Int32 dictionary key types so this branch cannot be reached for
other key types; reference the DataType::Dictionary match arm and the
as_dictionary_array::<Int32Type> call when making the change.
🧹 Nitpick comments (1)
datafusion/spark/src/function/math/hex.rs (1)
250-268: Consider supporting additional dictionary value types for consistency.The non-dictionary path supports
Utf8View,LargeUtf8,LargeBinary, andFixedSizeBinary, but the dictionary path only handlesInt64,Utf8, andBinary. Consider adding support for the missing types for consistency, or document why they're intentionally excluded.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
datafusion/spark/src/function/math/hex.rsdatafusion/sqllogictest/test_files/spark/math/hex.slt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
🔇 Additional comments (3)
datafusion/spark/src/function/math/hex.rs (2)
95-102: LGTM on return type logic.The return type correctly propagates the dictionary key type while replacing the value type with
Utf8for hex-encoded output. This maintains dictionary encoding through the transformation.
302-328: Good test coverage for dictionary paths.The tests appropriately cover:
- Dictionary with Utf8 values
- Dictionary with Int64 values
- Dictionary with Binary values
- Null handling in dictionary values
Note: All tests use
Int32Typekeys, which aligns with the current implementation limitation mentioned above.Also applies to: 331-357, 360-386, 426-447
datafusion/sqllogictest/test_files/spark/math/hex.slt (1)
67-80: Good integration test for dictionary-encoded input.The test effectively validates:
- Dictionary encoding with
Int32keys (matches the implementation)- Duplicate values (
'foo'and'bar'appear twice) to verify dictionary structure is preserved- NULL handling within the dictionary
- Correct hex encoding of string values
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| DataType::Dictionary(_, _) => { | ||
| let dict = as_dictionary_array::<Int32Type>(&array); | ||
| let dict_values = dict.values(); | ||
|
|
||
| match **value_type { | ||
| let encoded_values: ColumnarValue = match dict_values.data_type() { | ||
| DataType::Int64 => { | ||
| let arr = dict.downcast_dict::<Int64Array>().unwrap(); | ||
| hex_encode_int64(arr.into_iter(), dict.len()) | ||
| let arr = as_int64_array(dict_values)?; | ||
| hex_encode_int64(arr.iter(), arr.len())? | ||
| } | ||
| DataType::Utf8 => { | ||
| let arr = dict.downcast_dict::<StringArray>().unwrap(); | ||
| hex_encode_bytes(arr.into_iter(), lowercase, dict.len()) | ||
| let arr = as_string_array(dict_values); | ||
| hex_encode_bytes(arr.iter(), lowercase, arr.len())? | ||
| } | ||
| DataType::Binary => { | ||
| let arr = dict.downcast_dict::<BinaryArray>().unwrap(); | ||
| hex_encode_bytes(arr.into_iter(), lowercase, dict.len()) | ||
| let arr = as_binary_array(dict_values)?; | ||
| hex_encode_bytes(arr.iter(), lowercase, arr.len())? | ||
| } | ||
| _ => { | ||
| exec_err!( | ||
| return exec_err!( | ||
| "hex got an unexpected argument type: {}", | ||
| array.data_type() | ||
| ) | ||
| dict_values.data_type() | ||
| ); | ||
| } | ||
| } | ||
| }; | ||
|
|
||
| let encoded_values_array: ArrayRef = match encoded_values { | ||
| ColumnarValue::Array(a) => a, | ||
| ColumnarValue::Scalar(s) => Arc::new(s.to_array()?), | ||
| }; | ||
|
|
||
| let new_dict = dict.with_values(encoded_values_array); | ||
| Ok(ColumnarValue::Array(Arc::new(new_dict))) | ||
| } |
There was a problem hiding this comment.
Dictionary key type is hardcoded to Int32Type, but return_type accepts any key type.
The return_type method at line 97-99 accepts dictionaries with any key type and returns a dictionary preserving that key type. However, this compute path hardcodes Int32Type at line 247. If a dictionary with Int8Type, Int16Type, or Int64Type keys is passed, this will panic or produce incorrect results at runtime.
🔧 Suggested fix: Match on key type or add validation
Option 1: Handle multiple key types explicitly:
- DataType::Dictionary(_, _) => {
- let dict = as_dictionary_array::<Int32Type>(&array);
+ DataType::Dictionary(key_type, _) => {
+ macro_rules! handle_dict_key_type {
+ ($key_type:ty) => {{
+ let dict = as_dictionary_array::<$key_type>(&array);
+ // ... rest of handling
+ }};
+ }
+ match key_type.as_ref() {
+ DataType::Int8 => handle_dict_key_type!(Int8Type),
+ DataType::Int16 => handle_dict_key_type!(Int16Type),
+ DataType::Int32 => handle_dict_key_type!(Int32Type),
+ DataType::Int64 => handle_dict_key_type!(Int64Type),
+ // ... etc
+ }Option 2: If only Int32Type is intended to be supported, add validation in return_type:
fn return_type(&self, arg_types: &[DataType]) -> datafusion_common::Result<DataType> {
Ok(match &arg_types[0] {
- DataType::Dictionary(key_type, _) => {
+ DataType::Dictionary(key_type, _) if **key_type == DataType::Int32 => {
DataType::Dictionary(key_type.clone(), Box::new(DataType::Utf8))
}
_ => DataType::Utf8,
})
}🤖 Prompt for AI Agents
In `@datafusion/spark/src/function/math/hex.rs` around lines 246 - 278, The
DataType::Dictionary branch currently hardcodes
as_dictionary_array::<Int32Type>(&array) which will fail for dictionaries with
other integer key types; update the match to either (preferred) pattern-match
the dictionary key type and call as_dictionary_array with the matching integer
type (e.g., as_dictionary_array::<Int8Type>, ::<Int16Type>, ::<Int32Type>,
::<Int64Type>) and then proceed with encoding and dict.with_values, or
(alternative) enforce/support-only-Int32 by adding a validation in return_type
that rejects non-Int32 dictionary key types so this branch cannot be reached for
other key types; reference the DataType::Dictionary match arm and the
as_dictionary_array::<Int32Type> call when making the change.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.
🤖 Augment PR SummarySummary: Improves performance of Spark-compatible Changes:
🤖 Was this summary useful? React with 👍 or 👎 |
| hex_encode_bytes(array.iter(), lowercase, array.len()) | ||
| } | ||
| DataType::Dictionary(_, value_type) => { | ||
| DataType::Dictionary(_, _) => { |
There was a problem hiding this comment.
This dictionary path downcasts via as_dictionary_array::<Int32Type> (and return_type now preserves the input key type), so a non-Int32 dictionary input could panic at runtime. Consider validating/rejecting non-Int32 dictionary keys or supporting dictionary keys generically here.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting.
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The proposed changes try to cast to Int32Array but the user defined function signature allows any argument that could be casted to Int64, i.e. Int8, Int16, Int32 and Int64, so the casting here may fail at runtime. Prevents an application crash due to the panic caused by the unchecked casting. |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The Dictionary path should support the same key types as the non-Dictionary path, i.e. it should add support for LargeUtf8, LargeBinary, Utf8View and BinaryView. Prevents releasing an unfinished new feature. |
19832: To review by AI