- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1k
          [Variant] Support variant to Decimal32/64/128/256
          #8552
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
  
    [Variant] Support variant to Decimal32/64/128/256
  
  #8552
              Conversation
# Conflicts: # parquet-variant-compute/src/variant_get.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @liamzwbao 🙏
| let result = variant_get(&variant_array, options).unwrap(); | ||
| let result = result.as_any().downcast_ref::<Decimal32Array>().unwrap(); | ||
|  | ||
| assert_eq!(result.value(0), 124); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to match the test above, it would probably be good to assert result.precision() and result.scale() as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise for the Decimal 64/128/256 cases too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an initial pass. The high-level comments are more important than the low-level syntax nits.
| // scale_down means output has fewer fractional digits than input | ||
| // divide by 10^(input_scale - output_scale) with rounding | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about rounding (the whole point of decimal is to be lossless, unlike floating point). But I guess in this case the user specifically asked for the narrower type, so the usual worries about lossy coercion don't apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rounding makes sense here as arrow variant conversion could also cause precision loss due to rescaling. But we could also introduce a new option to fail on precision loss if needed
| let d = v.checked_div(div)?; | ||
| let r = v % div; | ||
|  | ||
| // rounding in the same way as convert_to_smaller_scale_decimal in arrow-cast | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at convert_to_smaller_scale_decimal, virtually all the logic in that function is doing exactly what we want here... and then the last line just applies the calculation as an appropriate unary operation on the input array. Rather than duplicate the logic, is there some way we could factor it out or otherwise reuse it? Problem is, it's in a different crate, so the factored out logic would have to be pub...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are internal helper functions, could refactor the logic, but not sure if it's good to expose that. WDYT @alamb ?
| DataType::Decimal32(precision, scale) => Decimal32( | ||
| VariantToDecimalArrowRowBuilder::new(cast_options, capacity, *precision, *scale)?, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I'm understanding correctly --
- Here, the user has requested e.g. Decimal32, so we create a decimal32 row builder
- The row builder invokes the VariantDecimalScalertrait, which eventually callsVariant::as_decimal4
- If the actual variant value was a wider decimal type, the conversion will produce Noneunless the unscaled value fits in the narrower type and the scale is small enough to fit as well (without rounding)?
But in this case, the user specifically requested rounding, so it seems odd to fail some of the time and not fail other times? In particular, going from Decimal32(9, 4) to Decimal32(9, 2) would succeed with rounding, but going from Decimal64(18, 4) to Decimal32(9, 2) would fail for a value like 1234567.8901, even tho the rescaled result 1234567.89 is a valid Decimal32(9, 2)?
In order to correctly handle all valid narrowing conversions, we need to rescale+round first, using the original variant type, and then try to narrow the result to the requested type.
The converse hazard exists for widening, where we need to widen first, and then rescale+round:
- Converting the Decimal32(9, 9)value0.999999999toDecimal64(*, 0)produces an intermediate value ten decimal digits.
- Converting the Decimal(9, 0)value999999999toDecimal64(18, 9)produces an intermediate (and final) value with 18 digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all possible combinations:
- We are converting unscaled value v1of typeVariant::DecimalXX(p1, s1)todatatypes::DecimalYY(p2, s2)
- The variant decimals have implied precision, so p1 is always one of {9, 19, 38} based on decimal type
- let n1 = p1-s1andn2 = p2-s2(the max number of integer digits before and after conversion)
- if n2 < n1, there is an inherent risk of overflow regardless of what scales are involved
- before even looking at scale and scale changes, we should first verify that v1fits inn2+s1digits. If not, flag overflow immediately. Otherwise, set n1=n2 and proceed to the next case.
- NOTE: This check does NOT require changing the type of v1, because total precision decreased.
 
- before even looking at scale and scale changes, we should first verify that 
- else if n2 = n1 and s2 < s1, there is an overflow risk when e.g. 0.999 rounds to 1.00
- Rescale, and then verify that the rounded result fits in p2digits.
- NOTE: This check does NOT require changing the type of v1, because total precision decreased.
 
- Rescale, and then verify that the rounded result fits in 
- else, there is no risk of overflow
- Convert v1to the new native type first
- Then rescale and round as needed
- NOTE: Both operations are infallible
 
- Convert 
That would correspond to something like the following code:
fn variant_to_unscaled_decimal32(
    variant: Variant<'_, '_>, 
    precision: u8, 
    scale: u8,
) -> Result<i32> {
    match variant {
        Variant::Decimal4(d) => {
            let s1 = d.scale();
            let mut n1 = VariantDecimal4::MAX_PRECISION - s1;
            let n2 = precision - scale;
            let v1 = d.integer();
            if n2 < n1 {
                // integer digits pose an overflow risk, and n2+s1 could even be out of precision range
                let max_value = MAX_DECIMAL32_FOR_EACH_PRECISION.get(n2 + s1);
                if max_value.is_none_or(|n| v1.unsigned_abs() > n) {
                    return Err(... overflow ...);
                }
                // else the value fits in n2 digits and we can pretend n1=n2
                n1 = n2;
            }
            if n2 == n1 {
                let v2 = ... rescale v1 and round up ...;
                if v2.unsigned_abs() > MAX_DECIMAL32_FOR_EACH_PRECISION[precision] {
                    return Err(... overflow ...);
                }
                // else the value can safely convert to the target type
                return Ok(v2 as _);
            }
            // no overflow possible, but still have to rescale and round
            let v1 = v1 as _;
            let v2 = ... rescale v1 and round up ...;
            Ok(v2)
        }
        Variant::Decimal8(d) => {
            ... almost the same code as for Decimal4 case ...
            ... except we use VariantDecimal8::MAX_PRECISION ...
            ... and we index into MAX_DECIMAL64_FOR_EACH_PRECISION ...
        }
        Variant::Decimal16(d) => {
            ... almost the same code as for Decimal4 case ...
            ... except we use VariantDecimal16::MAX_PRECISION ...
            ... and we index into MAX_DECIMAL128_FOR_EACH_PRECISION ...
        }
        Variant::Int8(i) => { ... treat it like `Variant::Decimal4(i, 0)` ... }
        Variant::Int16(i) => { ... treat it like `Variant::Decimal4(i, 0)` ... }
        Variant::Int32(i) => { ... treat it like `Variant::Decimal8(i, 0)` ... }
        Variant::Int64(i) => { ... treat it like `Variant::Decimal16(i, 0)` ... }
        _ => return Err(... not exact numeric data ...),
    }
}
fn variant_to_unscaled_decimal64(
    variant: Variant<'_, '_>, 
    precision: u8, 
    scale: u8,
) -> Result<i64> {
    ... exactly the same code as for decimal32 case ...
    ... but the changed return type means the `as _` casts now produce i64 ...
}
fn variant_to_unscaled_decimal128(
    variant: Variant<'_, '_>, 
    precision: u8, 
    scale: u8,
) -> Result<i128> {
    ... exactly the same code as for decimal32 case ...
    ... but the changed return type means the `as _` casts now produce i128 ...
}
fn variant_to_unscaled_decimal256(
    variant: Variant<'_, '_>, 
    precision: u8, 
    scale: u8,
) -> Result<i256> {
    ... exactly the same code as for decimal32 case ...
    ... but the changed return type means the `as _` casts now produce i256 ...
}So, we'd want two macros:
- Outer macro that produces the body of variant_to_unscaled_decimalXXfunctions
- Inner macro that produces the body of Variant::DecimalXXmatch arms
We need macros because integer types lack any helpful trait hierarchy that generics could take advantage of.
Update: Corrected a potential array out of bounds index in the n2 < n1 case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Watch out, arrow decimals can have negative scale. My analysis above didn't necessarily account for that; I'm not sure if the original code in this PR does?
In particular, negative scale allows infallible conversions such as VariantDecimal16(38, 0) to Decimal4(9, -30) with rounding, and the n1 vs. n2 checks I proposed above might not accurately capture this nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Let me dig a bit deeper and improve this conversion. Indeed it's possible to get a null for a valid decimal.
For negative scale, I think it's covered in this method, I will add more tests for it. Also, the validate function in the macro scale_variant_decimal will check and make sure it fit into a decimal with specific precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the implementation of cast_decimal_to_decimal<I, O> in arrow-cast, and it seems to already handle our cases quite well. Specifically:
- It checks is_infallible_cast, which covers the case 3.
- For scale-up (s1 <= s2), it first convertsI::NativetoO::Nativeand then rescales. For scale-down (s1 > s2), it divides and rounds the result (I::Native) before converting toO::Native. This approach gracefully handles native-type overflow. The subsequentDecimalType::is_valid_decimal_precisioncall ensures precision validation, similar to our currentMAX_DECIMAL32_FOR_EACH_PRECISION.get(n2 + s1)check, which effectively covers cases 1 & 2, wheren2 < n1orn2 == n1.
- That said, case 1 (n2 < n1) might present an optimization opportunity since we could skip rescaling. Functionally tho, the results should be the same. This could be explored in a follow-up PR.
Given this overlap, instead of duplicating logic, I plan to refactor the decimal cast function by extracting the shared core logic into a helper like and expose it, then we need a dependency on arrow-cast tho:
fn rescale_decimal<I, O>(
    integer: I::Native,
    input_precision: u8,
    input_scale: i8,
    output_precision: u8,
    output_scale: i8,
) -> Option<O::Native>
where
    I: DecimalType,
    O: DecimalType,
    I::Native: DecimalCast,
    O::Native: DecimalCast,Then, in our case, we can simply wire the type conversions through this helper:
fn variant_to_unscaled_decimal32(
    variant: Variant<'_, '_>,
    precision: u8,
    scale: u8,
) -> Result<i32> {
    match variant {
        Variant::Decimal4(d) => rescale_decimal::<Decimal32, Decimal32>(
            d.integer(), VariantDecimal4::MAX_PRECISION, d.scale(), precision, scale),
        Variant::Decimal8(d) => rescale_decimal::<Decimal64, Decimal32>(
            d.integer(), VariantDecimal8::MAX_PRECISION, d.scale(), precision, scale),
        Variant::Decimal16(d) => rescale_decimal::<Decimal128, Decimal32>(
            d.integer(), VariantDecimal16::MAX_PRECISION, d.scale(), precision, scale),
        Variant::Int8(i) => rescale_decimal::<Decimal32, Decimal32>(
            i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
        Variant::Int16(i) => rescale_decimal::<Decimal32, Decimal32>(
            i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
        Variant::Int32(i) => rescale_decimal::<Decimal32, Decimal32>(
            i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
        Variant::Int64(i) => rescale_decimal::<Decimal64, Decimal32>(
            i, VariantDecimal8::MAX_PRECISION, 0, precision, scale),
        _ => return Err(... not exact numeric data ...),
    }
}Let me know if you see any potential risks or edge cases I might have overlooked.
0f7665f    to
    b43ac66      
    Compare
  
    b43ac66    to
    9b6d0e1      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough review, @scovich! Addressed most of the comments, will improve the type cast then
| // scale_down means output has fewer fractional digits than input | ||
| // divide by 10^(input_scale - output_scale) with rounding | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rounding makes sense here as arrow variant conversion could also cause precision loss due to rescaling. But we could also introduce a new option to fail on precision loss if needed
| let d = v.checked_div(div)?; | ||
| let r = v % div; | ||
|  | ||
| // rounding in the same way as convert_to_smaller_scale_decimal in arrow-cast | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are internal helper functions, could refactor the logic, but not sure if it's good to expose that. WDYT @alamb ?
| DataType::Decimal32(precision, scale) => Decimal32( | ||
| VariantToDecimalArrowRowBuilder::new(cast_options, capacity, *precision, *scale)?, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Let me dig a bit deeper and improve this conversion. Indeed it's possible to get a null for a valid decimal.
For negative scale, I think it's covered in this method, I will add more tests for it. Also, the validate function in the macro scale_variant_decimal will check and make sure it fit into a decimal with specific precision
| I think this may be related to #8562 🤔 | 
5dcb456    to
    a7cdd33      
    Compare
  
    6638d82    to
    a48bbf4      
    Compare
  
    | Hi @scovich, yeah, the core functionality we need is just  Once #8580 is merged, I will apply the same fix here. The downside is that if we find a similar bug in the future, we’ll need to fix it in both places. But I think the refactor of  | 
| Hi @liamzwbao -- this PR has some conflicts. Can you please resolve them so I can merge the PR? Sorry for the delay. We are juggling many things. | 
| Thanks @liamzwbao and @scovich | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few post-humus nits to consider
| let mul = O::Native::from_decimal(10_i128) | ||
| .and_then(|t| t.pow_checked(delta_scale as u32).ok())?; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use the same performance optimization as the negative scale case below:
| let mul = O::Native::from_decimal(10_i128) | |
| .and_then(|t| t.pow_checked(delta_scale as u32).ok())?; | |
| let max = O::MAX_FOR_EACH_PRECISION.get(delta_scale)?; | |
| let mul = max.add_wrapping(O::Native::ONE); | 
(it didn't matter much in the columnar decimal cast code, but it probably does matter in row-wise variant cast code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- we should benchmark, but it might be faster to multiply by one than to execute the branch that distinguishes between zero and positive delta scale. If so, we would want code like this:
let (scaled, is_infallible_cast) = if delta_scale < 0 {
    // ... big comment about why ...
    let is_infallible = input_precision + delta_scale < output_precision;
    // ... comment about dividing out too many digits ...
    let delta_scale = delta_scale.unsigned_abs() as usize;
    let Some(max) = ... else { ... return zero ... };
      ...
    (O::Native::from_decimal(adjusted)?, is_infallible_cast)
} else {
    // ... big comment explaining why ...
    let is_infallible_cast = input_precision + delta_scale <= output_precision;
    let max = O::MAX_FOR_EACH_PRECISION.get(delta_scale)?;
    let mul = max.add_wrapping(O::Native::ONE);
    let x = O::Native::from_decimal(value)?;
    (x.mul_checked(mul).ok()?, is_infallible_cast)
}
(is_infallible_cast || O::is_valid_decimal_precision(scaled, output_precision)).then(scaled)| pub(crate) fn rescale_decimal<I, O>( | ||
| value: I::Native, | ||
| input_precision: u8, | ||
| input_scale: i8, | ||
| output_precision: u8, | ||
| output_scale: i8, | ||
| ) -> Option<O::Native> | ||
| where | ||
| I: DecimalType, | ||
| O: DecimalType, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit to consider (saves space)
| pub(crate) fn rescale_decimal<I, O>( | |
| value: I::Native, | |
| input_precision: u8, | |
| input_scale: i8, | |
| output_precision: u8, | |
| output_scale: i8, | |
| ) -> Option<O::Native> | |
| where | |
| I: DecimalType, | |
| O: DecimalType, | |
| pub(crate) fn rescale_decimal<I: DecimalType, O: DecimalType>( | |
| value: I::Native, | |
| input_precision: u8, | |
| input_scale: i8, | |
| output_precision: u8, | |
| output_scale: i8, | |
| ) -> Option<O::Native> | |
| where | 
|  | ||
| // Determine if the cast is infallible based on precision/scale math | ||
| let is_infallible_cast = | ||
| is_infallible_decimal_cast(input_precision, input_scale, output_precision, output_scale); | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this whole block down to where we actually use it -- declare near first (only) use
| output_precision: u8, | ||
| output_scale: i8, | ||
| ) -> bool { | ||
| let delta_scale = output_scale - input_scale; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could have passed this in, our caller already computed it. but I guess this is more regular?
# Which issue does this PR close? - Followup of #8552. # Rationale for this change Code cleanup and optimization # What changes are included in this PR? Addressed the post-comments in #8552 and refactor/optimize the method `rescale_decimal` # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
…n Parquet variant casts (#8689) # Which issue does this PR close? - Closes #8670. # Rationale for this change We currently have two separate code paths that both handle decimal casting between different (precision, scale) pairs. Without unifying the logic, a fix in one place often needs to be duplicated in the other (e.g., #8579 fixed the `arrow-cast` and #8552 fixed the `parquet-variant-compute`), which can easily lead to divergence when contributors lack full context. This PR consolidates the decimal rescale logic for both `arrow-cast` and `parquet-variant-compute`. # What changes are included in this PR? 1. Extract the shared array-unary logic from `convert_to_smaller_scale_decimal` and `convert_to_bigger_or_equal_scale_decimal` into `apply_decimal_cast` 2. Move the rescale-closure creation into `make_upscaler` and `make_downscaler` so that they can be used in `parquet-compute-variant` 3. rework `rescale_decimal` in `parquet-compute-variant` to use the new `make_upscaler` and `make_downscaler` utilities. One challenge is incorporating the large-scale reduction path (aka the `delta_scale` cannot fit into `I::MAX_PRECISION`) into `make_downscaler` without hurting performance. Returning 0 directly is usually cheaper than applying a unary operation to return zero. Therefore, `make_downscaler` may return None, and it is the caller’s responsibility to handle this case appropriately based on the documented behavior. # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
What changes are included in this PR?
Variant→Decimal32/64/128/256Are these changes tested?
Yes
Are there any user-facing changes?
New cast types supported