Skip to content

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Sep 27, 2025

Which issue does this PR close?

This is a followup of #17501 and #17489 (Should I open a dedicated issue?).

The changes are taken out of #17759 as the SQL-facing changes ended up making more of an impact than I expected, and should probably be discussed in #17747.

Rationale for this change

Improve support for working with decimal columns of different width

What changes are included in this PR?

The biggest change in this PR is that type coercion for binary expression now not only supports the new decimal types, but it will try and extend one side into the other, which would just fail right now.

Minor changes

  1. Support abs(deicmal32/64)
  2. Support SMJ on Decimal32/64 (256 was left out, and I'm unclear if it was overlooked or omitted intentionally).

Are these changes tested?

Additional coercion tests

Are there any user-facing changes?

No public API changes, but more combination of operations should work.

@github-actions github-actions bot added logical-expr Logical plan and expressions common Related to common crate functions Changes to functions implementation physical-plan Changes to the physical-plan crate spark labels Sep 27, 2025
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS

Comment on lines +1999 to +2000
DataType::Decimal32(..) => compare_value!(Decimal32Array),
DataType::Decimal64(..) => compare_value!(Decimal64Array),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am curious why 256 is omitted? Perhaps we can add it in if there's no compiler error doing so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add it. Through this work I've noticed a few places where it was omitted, but I've yet to find a reason or even a failing test that would explain it, usually its just a matter of adding a branch to a match statement like here.

Comment on lines 99 to 106
Int8 | Int16 | Int32 | Int64 | Float32 | Float64 | Decimal128(_, _)
Int8 | Int16
| Int32
| Int64
| Float32
| Float64
| Decimal32(_, _)
| Decimal64(_, _)
| Decimal128(_, _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use is_signed_numeric() here potentially (though that brings in Float16)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like its probably fine, also talked to my local Spark expert and seems like it should be fine (+ I assume there are other places where we deal with the conversion to spark's type system)

Comment on lines +1022 to +1024
let s = s1.max(s2);
let range = (p1 as i8 - s1).max(p2 as i8 - s2);
let required_precision = (range + s) as u8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have:

Decimal256 with precision 76 (max) and scale 0, and Decimal128 with precision 38 (max) with scale 1;

So s = 1, range = 76, required_precision = 76 + 1 -> overflow?

Is this a valid case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think an overflow is valid, I'll have to think about it and maybe look into solutions in other systems.
We can also just return None in that case, which should force the user to add an explicit cast to one side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around a bit, and what I could find is:

  1. DataFusion already has multiple issues regarding cast overflow/precision loss (decimal calculate overflow but not throw error #16406, Datafusion downcasts decimal loosing precision  #13492), which I'm happy to take on but are unrelated here.
  2. Spark (which seems to be the main inspiration for this code) has a configuration to control how it handles these cases (here and here).

I'm not sure what's the desired behavior regarding precision loss (should it be configurable? Is there currently an accepted desired behavior?), I think for this PR it should be fine to just return None if the precision overflows, and take the bigger conversation into an issue where people can weigh in, and I'll be glad to take that forward. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning None in cases like this for this PR is fine 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as part of fd1f043

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers; left another minor comment related to the check below. Also would be nice if we had a test for this edge case.

@AdamGS AdamGS force-pushed the adamg/fill-decimal-gaps branch from 62eb314 to b4c7400 Compare September 29, 2025 10:58
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
@AdamGS AdamGS force-pushed the adamg/fill-decimal-gaps branch from b4c7400 to fd1f043 Compare September 29, 2025 11:05
@AdamGS AdamGS mentioned this pull request Sep 29, 2025
18 tasks
Comment on lines 1037 to 1061
// We currently don't handle cases where the required precision overflows
if required_precision > DECIMAL256_MAX_PRECISION {
return None;
}

// Choose the larger variant between the two input types
match (lhs_type, rhs_type) {
(Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) => {
Some(Decimal64(required_precision, s))
}
(Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) => {
Some(Decimal128(required_precision, s))
}
(Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) => {
Some(Decimal256(required_precision, s))
}
(Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) => {
Some(Decimal128(required_precision, s))
}
(Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) => {
Some(Decimal256(required_precision, s))
}
(Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) => {
Some(Decimal256(required_precision, s))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We currently don't handle cases where the required precision overflows
if required_precision > DECIMAL256_MAX_PRECISION {
return None;
}
// Choose the larger variant between the two input types
match (lhs_type, rhs_type) {
(Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) => {
Some(Decimal64(required_precision, s))
}
(Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) => {
Some(Decimal128(required_precision, s))
}
(Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) => {
Some(Decimal256(required_precision, s))
}
(Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) => {
Some(Decimal128(required_precision, s))
}
(Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) => {
Some(Decimal256(required_precision, s))
}
(Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) => {
Some(Decimal256(required_precision, s))
}
// Choose the larger variant between the two input types
match (lhs_type, rhs_type) {
(Decimal32(_, _), Decimal64(_, _)) | (Decimal64(_, _), Decimal32(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => {
Some(Decimal64(required_precision, s))
}
(Decimal32(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal32(_, _)) if required_precision <= DECIMAL128_MAX_PRECISION => {
Some(Decimal128(required_precision, s))
}
(Decimal32(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal32(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => {
Some(Decimal256(required_precision, s))
}
(Decimal64(_, _), Decimal128(_, _)) | (Decimal128(_, _), Decimal64(_, _)) if required_precision <= DECIMAL128_MAX_PRECISION => {
Some(Decimal128(required_precision, s))
}
(Decimal64(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal64(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => {
Some(Decimal256(required_precision, s))
}
(Decimal128(_, _), Decimal256(_, _)) | (Decimal256(_, _), Decimal128(_, _)) if required_precision <= DECIMAL64_MAX_PRECISION => {
Some(Decimal256(required_precision, s))
}

Do we need to account for the other decimal types like so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some tests to make sure what's the correct thing here is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as part of 4145a04

Comment on lines +1022 to +1024
let s = s1.max(s2);
let range = (p1 as i8 - s1).max(p2 as i8 - s2);
let required_precision = (range + s) as u8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers; left another minor comment related to the check below. Also would be nice if we had a test for this edge case.

pub fn decimal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;

match (lhs_type, rhs_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be cleaner like so:

/// Decimal coercion rules.
pub fn decimal_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
    use arrow::datatypes::DataType::*;

    // Prefer decimal data type over floating point for comparison operation
    match (lhs_type, rhs_type) {
        // Same decimal types
        (lhs_type, rhs_type)
            if std::mem::discriminant(lhs_type) == std::mem::discriminant(rhs_type) =>
        {
            get_wider_decimal_type(lhs_type, rhs_type)
        }
        // Mismatched decimal types
        (lhs_type, rhs_type)
            if is_decimal(lhs_type)
                && is_decimal(rhs_type)
                && std::mem::discriminant(lhs_type)
                    != std::mem::discriminant(rhs_type) =>
        {
            get_wider_decimal_type_cross_variant(lhs_type, rhs_type)
        }
        // Decimal + non-decimal types
        (Decimal32(_, _) | Decimal64(_, _) | Decimal128(_, _) | Decimal256(_, _), _)
        | (_, Decimal32(_, _) | Decimal64(_, _) | Decimal128(_, _) | Decimal256(_, _)) => {
            get_common_decimal_type(lhs_type, rhs_type)
        }
        (_, _) => None,
    }
}

Following what was done above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot the is_decimal() checks for the first branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I've added them locally :) should have something soon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done as part of 4145a04

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 29, 2025

Added tests to check for precision overflow. This really brings things close to making decimal a logical type, which would be nice but would also make it behave really differently than any other type.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Jefffrey Jefffrey added this pull request to the merge queue Sep 30, 2025
Merged via the queue into apache:main with commit 032117a Sep 30, 2025
28 checks passed
AdamGS added a commit to AdamGS/arrow-datafusion that referenced this pull request Oct 3, 2025
* More small decimal support

* CR comments

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

* Add tests and cleanup some code

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
AdamGS added a commit to AdamGS/arrow-datafusion that referenced this pull request Oct 3, 2025
* More small decimal support

* CR comments

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

* Add tests and cleanup some code

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate spark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants