Skip to content

Comments

19065: refactor: Refactor spark width bucket signature away from user defined#97

Open
martin-augment wants to merge 1 commit intomainfrom
pr-19065-2025-12-03-13-54-08
Open

19065: refactor: Refactor spark width bucket signature away from user defined#97
martin-augment wants to merge 1 commit intomainfrom
pr-19065-2025-12-03-13-54-08

Conversation

@martin-augment
Copy link
Owner

19065: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Walkthrough

This pull request extends the logical type system in DataFusion and refactors the width_bucket function implementation. Two new logical type singleton variants are introduced: LOGICAL_INTERVAL_YEAR_MONTH and LOGICAL_DURATION_MICROSECOND, along with corresponding getter functions to provide globally accessible type references. Separately, the width_bucket function's type coercion and error handling are restructured. A new helper constructs a composite type signature supporting numeric, duration, and interval types. The UDF invocation flow is updated to use take_function_args for argument extraction and casting, and error messages for unsupported data types are made more explicit. The public API surface remains unchanged.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-19065-2025-12-03-13-54-08

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08eeeba and d9c0e4c.

📒 Files selected for processing (2)
  • datafusion/common/src/types/builtin.rs (2 hunks)
  • datafusion/spark/src/function/math/width_bucket.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

  • datafusion/spark/src/function/math/width_bucket.rs
🧬 Code graph analysis (1)
datafusion/spark/src/function/math/width_bucket.rs (2)
datafusion/common/src/utils/mod.rs (1)
  • take_function_args (925-939)
datafusion/expr-common/src/signature.rs (4)
  • numeric (1124-1130)
  • new_implicit (946-958)
  • new_exact (938-940)
  • one_of (1206-1212)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (5)
datafusion/common/src/types/builtin.rs (1)

19-20: New logical singletons for Interval(YearMonth) and Duration(Microsecond) look consistent

The added TimeUnit::* import and the LOGICAL_INTERVAL_YEAR_MONTH / LOGICAL_DURATION_MICROSECOND singletons follow the existing singleton_variant! pattern and align with the Arrow units used by the width_bucket kernel. No issues from a type or API-usage perspective.

Also applies to: 80-99

datafusion/spark/src/function/math/width_bucket.rs (4)

32-37: Coercion-based signature construction is coherent with kernel types

The Coercion setup for numeric, duration(us), interval YM, and interval MDN, plus width_bucket_signature producing TypeSignature::Coercible([arg, arg, arg, bucket]), lines up with the four width_bucket_kern branches (Float64, Duration(Microsecond), Interval(YearMonth), Interval(MonthDayNano)) and the Int32 bucket. This should ensure inputs are coerced into exactly the types the kernel expects, while giving a single consistent parameter naming scheme.

Please double-check that constraining num_buckets to TypeSignatureClass::Integer only matches Spark’s expected types (i.e., no need to accept decimal/float buckets).

Also applies to: 39-42, 46-46, 62-71, 72-85, 86-105, 198-200


125-136: Argument extraction via take_function_args and to_array looks correct

Using take_function_args(self.name(), &args.args)? to obtain four &ColumnarValue values and then materializing them with to_array(args.number_rows)? cleanly enforces the arity and delegates casting to the planner/coercion logic. The resulting arrays vector is guaranteed to have length 4 before calling width_bucket_kern.


183-185: Wiring MDN branch through width_bucket_interval_mdn_exact is consistent

The Interval(MonthDayNano) match arm now directly calls width_bucket_interval_mdn_exact(v, min, max, n_bucket), which matches the function’s signature and the dedicated MDN tests below. This keeps the “exact” MDN handling centralized and aligns the dispatch with the other specialized kernels.


188-194: Internal error for unsupported runtime types is acceptable but is now “internal” by contract

The fall-through arm now returns internal_err!("width_bucket received unexpected data types: ..."), and the test only asserts on the common message fragment. Given the stronger type-signature/coercion story, this path should be unreachable from normal UDF execution and correctly signals a planner/engine bug if hit. Just be aware that direct kernel callers (like tests) will now see this as an internal error rather than a user-facing “unsupported types” execution error.

Also applies to: 783-785


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Dec 3, 2025

Code Review for PR #97: Refactor Spark width_bucket signature

Summary

This PR refactors the width_bucket function from using a user-defined signature with manual coerce_types implementation to the new type signature system with Coercion. This is a good modernization that makes the code more consistent with other Spark functions.


✅ Positive Aspects

  1. Consistent with modern patterns: The refactoring aligns with the newer approach used in other functions like bit_shift.rs, making the codebase more maintainable.

  2. Cleaner error handling: Replacing unsupported_data_types_exec_err with internal_err\! is appropriate since type validation now happens at the signature level.

  3. Good test coverage: The existing comprehensive test suite (27+ test cases) covering all data types, edge cases, nulls, and error conditions remains intact.

  4. Proper use of new builtin types: Adding logical_interval_year_month and logical_duration_microsecond to the builtin types follows the existing pattern correctly.

  5. Simplified invoke logic: Removing make_scalar_function wrapper in favor of direct array handling makes the execution path clearer.


🔍 Issues & Concerns

1. Missing import cleanup (Minor)

Location: datafusion/spark/src/function/math/width_bucket.rs:21

The old import for unsupported_data_types_exec_err was removed, but the error_utils module itself is no longer used. Consider removing:

use crate::function::error_utils::unsupported_data_types_exec_err;

This line is already removed, so this is good. ✅

2. Import organization (Style - Minor)

Location: datafusion/spark/src/function/math/width_bucket.rs:44-46

The Arrow imports are split into two groups:

use arrow::array::{
    Array, ArrayRef, DurationMicrosecondArray, Float64Array, IntervalMonthDayNanoArray,
    IntervalYearMonthArray,
};
// ... other imports ...
use arrow::array::{Int32Array, Int32Builder};
use arrow::datatypes::TimeUnit::Microsecond;

Suggestion: Consolidate all arrow::array imports into a single block for consistency:

use arrow::array::{
    Array, ArrayRef, DurationMicrosecondArray, Float64Array, Int32Array, Int32Builder,
    IntervalMonthDayNanoArray, IntervalYearMonthArray,
};

3. Error message consistency (Minor)

Location: datafusion/spark/src/function/math/width_bucket.rs:188-195

The new error uses internal_err\! which is appropriate since this should never happen after proper type coercion. However, the error message could be slightly more helpful:

other => internal_err\!(
    "width_bucket received unexpected data types: {:?}, {:?}, {:?}, {:?}",
    other,
    minv.data_type(),
    maxv.data_type(),
    nb.data_type()
),

Suggestion: Consider adding context that this is a post-coercion error to help debugging:

other => internal_err\!(
    "width_bucket: unexpected data type after coercion: {:?} (expected Float64, Duration(Microsecond), Interval(YearMonth), or Interval(MonthDayNano))",
    other
),

4. Potential type signature issue (Needs verification)

Location: datafusion/spark/src/function/math/width_bucket.rs:62-85

The numeric coercion accepts:

let numeric = Coercion::new_implicit(
    TypeSignatureClass::Native(logical_float64()),
    vec\![
        TypeSignatureClass::Integer,
        TypeSignatureClass::Float,
        TypeSignatureClass::Decimal,
    ],
    NativeType::Float64,
);

The old implementation used is_signed_numeric which checked for signed types. The new implementation accepts all integers (including unsigned via TypeSignatureClass::Integer) and decimals.

Question: Is this intentional? The old coerce_types implementation was:

(a, b, c, &(Int8 | Int16 | Int32 | Int64))
    if is_signed_numeric(a) && is_signed_numeric(b) && is_signed_numeric(c)

If unsigned integers should not be accepted, you need to be more explicit with the coercion classes.

5. Documentation opportunity (Enhancement)

Consider adding a doc comment to the width_bucket_signature helper function explaining its purpose:

/// Creates a type signature for width_bucket with three arguments of the same type
/// (value, min, max) and a bucket count argument.
fn width_bucket_signature(arg: &Coercion, bucket: &Coercion) -> TypeSignature {
    TypeSignature::Coercible(vec\![arg.clone(), arg.clone(), arg.clone(), bucket.clone()])
}

🔒 Security Considerations

No security concerns identified. The function maintains the same validation logic for:

  • Null handling
  • Zero or negative bucket counts
  • Equal min/max bounds
  • NaN and infinite value checks

⚡ Performance Considerations

  • Neutral: The refactoring doesn't introduce performance regressions. Type coercion happens at planning time, not execution time.
  • The removal of make_scalar_function wrapper is slightly more efficient by avoiding an extra function call layer.

🧪 Test Coverage

The existing test suite is comprehensive and covers:

  • All supported data types (Float64, Duration, Interval variants)
  • Ascending and descending ranges
  • Boundary conditions
  • Null propagation
  • Invalid inputs (wrong arg count, unsupported types, invalid bucket counts)
  • Edge cases (NaN, equal bounds, negative ranges)

Recommendation: The tests pass as-is, which confirms backward compatibility. ✅


📋 Recommendations Summary

Must Fix:

  • None

Should Fix:

  1. Verify that unsigned integer handling is intentional (see issue Add Claude Code GitHub Workflow #4 above)
  2. Consolidate arrow::array imports for consistency

Nice to Have:

  1. Improve error message clarity
  2. Add documentation to helper function

Overall Assessment

Status: ✅ Approve with minor suggestions

This is a solid refactoring that modernizes the code to use the current type signature system. The logic remains functionally equivalent, and the comprehensive test suite validates the behavior. The main concern is verifying whether the change in numeric type handling (accepting unsigned integers) is intentional.

Code Quality: 8.5/10
Best Practices: 9/10
Test Coverage: 10/10
Security: 10/10

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@martin-augment
Copy link
Owner Author

2. Import organization (Style - Minor)

Location: datafusion/spark/src/function/math/width_bucket.rs:44-46

The Arrow imports are split into two groups:

use arrow::array::{
    Array, ArrayRef, DurationMicrosecondArray, Float64Array, IntervalMonthDayNanoArray,
    IntervalYearMonthArray,
};
// ... other imports ...
use arrow::array::{Int32Array, Int32Builder};
use arrow::datatypes::TimeUnit::Microsecond;

Suggestion: Consolidate all arrow::array imports into a single block for consistency:

use arrow::array::{
    Array, ArrayRef, DurationMicrosecondArray, Float64Array, Int32Array, Int32Builder,
    IntervalMonthDayNanoArray, IntervalYearMonthArray,
};

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! The imports could be optimized and combined into one group. This will make it easier to locate them if/when a developer needs to confirm the module a given item comes from.

@martin-augment
Copy link
Owner Author

3. Error message consistency (Minor)

Location: datafusion/spark/src/function/math/width_bucket.rs:188-195

The new error uses internal_err\! which is appropriate since this should never happen after proper type coercion. However, the error message could be slightly more helpful:

other => internal_err\!(
    "width_bucket received unexpected data types: {:?}, {:?}, {:?}, {:?}",
    other,
    minv.data_type(),
    maxv.data_type(),
    nb.data_type()
),

Suggestion: Consider adding context that this is a post-coercion error to help debugging:

other => internal_err\!(
    "width_bucket: unexpected data type after coercion: {:?} (expected Float64, Duration(Microsecond), Interval(YearMonth), or Interval(MonthDayNano))",
    other
),

value:annoying; category:bug; feedback:The Claude AI reviewer is not correct! The proposed way in the Pull Request gives the same information but it is future-proof and adding a new supported type will be listed automatically with the need of an update by the developer.

@martin-augment
Copy link
Owner Author

4. Potential type signature issue (Needs verification)

Location: datafusion/spark/src/function/math/width_bucket.rs:62-85

The numeric coercion accepts:

let numeric = Coercion::new_implicit(
    TypeSignatureClass::Native(logical_float64()),
    vec\![
        TypeSignatureClass::Integer,
        TypeSignatureClass::Float,
        TypeSignatureClass::Decimal,
    ],
    NativeType::Float64,
);

The old implementation used is_signed_numeric which checked for signed types. The new implementation accepts all integers (including unsigned via TypeSignatureClass::Integer) and decimals.

Question: Is this intentional? The old coerce_types implementation was:

(a, b, c, &(Int8 | Int16 | Int32 | Int64))
    if is_signed_numeric(a) && is_signed_numeric(b) && is_signed_numeric(c)

If unsigned integers should not be accepted, you need to be more explicit with the coercion classes.

value:good-but-wont-fix; category:bug; feedback:The Claude AI reviewer is correct! The new signature is more wide and accepts also unsigned integers (u8, u16, u32, etc.). Since all of them could be casted to f64 there is no real problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants