Decimal Precision Validation #2387

tustvold · 2022-08-09T09:44:32Z

Which part is this question about

Generally the approach taken by this crate is that a given ArrayData and by extension Array only contains valid data. For example, a StringArray is valid UTF-8 with each index at a codepoint boundary, a dictionary array only has valid indexes, etc... This allows eliding bound checks on access within kernels.

However, in order for this to be sound, it must be impossible to create invalid ArrayData using safe APIs. This means that safe APIs must either:

Generate valid data by construction - e.g. the builder APIs
Validate data - e.g. ArrayData::try_new

For the examples above incorrect validation can very clearly lead to UB. The situation for decimal values is a bit more confused, in particular I'm not really clear on what the implications of a value that exceeds the precision actually are. However, some notes:

As far as I can tell we don't protect against overflow of normal integer types
We don't have any decimal arithmetic kernels (yet)
The decimal types are fixed bit width and so the precision isn't used to impact their representation

Describe your question

My question boils down to:

What is the purpose of the precision argument? Is it just for interoperability with other non-arrow representations?
Is there a requirement to saturate/error at the bounds of the precision, or can we simply overflow/saturate at the bounds of the underlying representation
Does validating the precision on ingest to ArrayData actually elide any validation when performing computation?

The answers to this will dictate if we can just take a relaxed attitude to precision, and let users opt into validation if they care, and otherwise simply ignore it.

I tried to understand what the C++ implementation is doing, but I honestly got lost. It almost looks like it is performing floating point operations and then rounding them back, which seems surprising...

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2022-08-09T10:48:38Z

cc @HaoYang670 @viirya @liukun4515

This is an excellent question -- "why do we need to validate Decimal Precision at all" -- it will likely drive design decisions such as what @HaoYang670 raised on #2362

HaoYang670 · 2022-08-09T13:43:49Z

What is the purpose of the precision argument?

As far as I know. The precision is actually the decimal precision, or how many digits are there in decimal representation.

The precision add a runtime bound for the value. (The underlying bit width (128 or 256) is a compile time bound.). Whenever you change the value of precision at runtime, you might need to check the value validation.

why do we need to validate Decimal Precision at all.

Because Decimal type is somewhat a mixture of static type (128bits or 256bits) and dynamic type (precision and scale). Which means we have to do validation at runtime to check overflow.

I'm not really clear on what the implications of a value that exceeds the precision actually are.

I guess the behavior is undefined. Because False can imply anything.

tustvold · 2022-08-09T14:02:16Z

Which means we have to do validation at runtime to check overflow.

Given we don't care about this for other types, because it has serious performance implications, why would we care about it for decimals? I guess my question can be phrased as: Given decimal overflow cannot lead to undefined behaviour why do we need to check precision?

A somewhat related point, is that in Rust signed integer overflow in rust is not undefined behaviour and is explicitly defined to wrap as twos complement, this is because checking at runtime is prohibitively expensive. I don't see an obvious reason we should treat decimals any differently, the overflow behaviour is perfectly well defined...

alamb · 2022-08-09T14:29:23Z

the overflow behaviour is perfectly well defined...

Thought definitely they are not ideal 😆 But if someone cares we can add the rust style "checked_add()` etc kernels and by default leave it unchecked 🤔

viirya · 2022-08-09T17:26:18Z

What is the purpose of the precision argument? Is it just for interoperability with other non-arrow representations?
Is there a requirement to saturate/error at the bounds of the precision, or can we simply overflow/saturate at the bounds of the underlying representation
Does validating the precision on ingest to ArrayData actually elide any validation when performing computation?

I think that the precision for decimal specifies the representation range like other types do. It's useful so we can know at which moment an overflow can be happened and what range of values are possible to use with the type without overflow.

As we don't have native decimal type in Rust, it is internally represented by bytes with maximum length for the maximum representation range. It could be easily specified with a value over its precision. For example, we can create a 128-bit decimal with precision 5 by giving it a value which is represent-able only by precision 10 (i.e., 10 digits). For native types, you might get a overlfow value, but for this case the 10-digits value still can be put into 128-bit bytes. We don't have it as overflow value there, I think.

I think that is why the things become complicated. We may need to validate decimal values at some moments. For now we throw some errors for invalid (overflow) values. Instead, I think it might also make sense to change the invalid values to overflow representation to simulate overflow. Although I think for some systems, overflow will throw error anyway. But as mentioned:

As far as I can tell we don't protect against overflow of normal integer types

I guess that decimal overflow should be no different.

I think that interoperability is a concern. Because I'm not sure how other systems will interpret the decimal values from this crate. Assume we don't do any validation on decimals. If they just take the values without validation, I think the behavior might be unexpected. It could be an overflow at that system, or it truncates the value?

tustvold · 2022-08-10T08:51:00Z

If they just take the values without validation, I think the behavior might be unexpected. It could be an overflow at that system, or it truncates the value?

Unexpected, perhaps, but not undefined. I'm not saying we don't provide methods to validate decimals, just that we make this explicitly opt-in?

alamb · 2022-08-11T20:53:20Z

I don't have a strong opinion on this question, for what it is worth. Ensuring that array data has reasonable values for decimal seems ok to me (to fail fast rather than fail at some much later point) but if there is some compelling reason not to validate I could be convinced that way too

tustvold · 2022-08-23T10:16:33Z

I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed

from_iter - no validation
with_precision_and_scale - validation if precision less
DecimalBuilder - disable validation with safe API

The result is that it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs, consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is...

For example, if you use from_iter to collect an iterator into a DecimalArray and then use with_precision_and_scale to set the precision and scale, this will elide the validation if the new precision is greater than the default, despite the validation never occurring...

I can see 3 possible paths forward:

Strict Validation

Invariant that no values in a DecimalArray exceed the precision
Aways validate on construction, and mutation

This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types.

Loose Validation

Mutation operations on DecimalArray saturate at the bounds of the underlying stored type
Continue to provide APIs to validate no values have overflowed
Perfectly legal for a DecimalArray to contain overflowed values, and it must not cause UB for this to be the case

This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation

No Validation

Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined.

I think that is why the things become complicated. We may need to validate decimal values at some moments

My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue.

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

liukun4515 · 2022-08-23T12:31:52Z

I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed

from_iter - no validation

with_precision_and_scale - validation if precision less

DecimalBuilder - disable validation with safe API

The result is that it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs, consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is...

For example, if you use from_iter to collect an iterator into a DecimalArray and then use with_precision_and_scale to set the precision and scale, this will elide the validation if the new precision is greater than the default, despite the validation never occurring...

I can see 3 possible paths forward:

Strict Validation

Invariant that no values in a DecimalArray exceed the precision

Aways validate on construction, and mutation

This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types.

Loose Validation

Mutation operations on DecimalArray saturate at the bounds of the underlying stored type

Continue to provide APIs to validate no values have overflowed

Perfectly legal for a DecimalArray to contain overflowed values, and it must not cause UB for this to be the case

This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation

No Validation

Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined.

I think that is why the things become complicated. We may need to validate decimal values at some moments

My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue.

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

In my opinion, the reason of doing validation for the decimal array is that we would like to make sure all the elements in the array are in within the range for the precision.
If we don't care about the if the element in the array is out of bounds, then we don't need to do validation in the kernel.

For example, I call the Cast method to cast int32 array to int8 array, do we need to do validation and check the overflow for the casting? If we need to do the check and return the error message to user, we need to do validation in the kernel api.

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

liukun4515 · 2022-08-23T12:38:25Z

My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere.

Can you tell me where we handle the overflow? I can't get your thoughts

tustvold · 2022-08-23T12:41:20Z

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

do we need to do validation and check the overflow for the casting

No, which is why I am trying to argue for not bothering to do the same for decimals... But currently we sort of do and sort of don't and I don't really understand what is going on

Edit: apparently we are doing checked casting for numerics, so I don't know anymore...

liukun4515 · 2022-08-23T13:36:25Z

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

do we need to do validation and check the overflow for the casting

No, which is why I am trying to argue for not bothering to do the same for decimals... But currently we sort of do and sort of don't and I don't really understand what is going on

Edit: apparently we are doing checked casting for numerics, so I don't know anymore...

other critical path for decimal, do we need to do validation when reading the parquet data to decimal?
@tustvold

liukun4515 · 2022-08-23T13:41:19Z

I have no preference, but just don't want to do unnecessary validation which will impact the performance of reading data/casting.

liukun4515 · 2022-08-23T13:58:38Z

Can you tell me where we handle the overflow? I can't get your thoughts

In all the places where we are currently doing validation?

But as far as Decimal data type, If I cast the Decimal(8,0) array) to Decimal(12,0) array`, each element can't be overflow or out of the range for the new precision, we don't need to do validation and improve the performance.

Only if you can assert that the data in the Decimal(8,0) was in bounds of the precision of 8 😁 That's what is so wildly inconsistent at the moment, if you don't consistently validate, you can't elide as you don't know if validation has actually occurred

We can do the strict validation, but just skip the point where we make sure it doesn't need the validation.

If we follow the strict validation, we can make sure all the element in the DecimalArray(8,0) are in the bounds of the precision 8, so we don't need do the validation when casting from decimal(8,0) to `decimal(12,0).
That is what I am doing!!

I only find three scenes that we don't need to do validation when reading decimal data from parquet file (the schema is got from the parquet file metadata. The data/metadata in the parquet are matched); casting decimal array from the small range to the big range; take operation.

@tustvold

tustvold · 2022-08-23T14:12:21Z

That is what I am doing!!

I know, my point is we aren't doing strict validation currently so the optimisation is ill-formed until such a time as we are doing strict validation, if we wish to do so

liukun4515 · 2022-08-24T04:15:43Z

I find other thing about the casting.
I write the test code for cast int32 to int8 like below

        let a = Int32Array::from(vec![10000, 17890]);
        let array = Arc::new(a) as ArrayRef;
        let b = cast(&array, &DataType::Int8).unwrap();
        let c = b.as_any().downcast_ref::<Int8Array>().unwrap();
        println!("{:?}", c);

All data are out of the range int8, we will get [NULL,NULL], there is no cast_options to control the behavior.

But for decimal data type conversion, we will get Error.

I think the two results are not inconsistent.
Which is the right behavior for the cast when occurs overflow? @alamb @tustvold

liukun4515 · 2022-08-24T05:23:59Z

That is what I am doing!!

I know, my point is we aren't doing strict validation currently so the optimisation is ill-formed until such a time as we are doing strict validation, if we wish to do so

Maybe we can follow the behavior the arrow c++？make consistent with c++ version.
I think this is good way. If no validation in the c++, we will don't do validation in rust.

Do you think so?
@alamb @tustvold

liukun4515 · 2022-08-25T06:25:21Z

any conclusion for this issue? @tustvold

tustvold · 2022-08-25T06:27:14Z

I thought you were going to take a look at the C++ implementation?

liukun4515 · 2022-08-25T07:16:33Z

I thought you were going to take a look at the C++ implementation?

Yes, I think i will finish it today.

Maybe @viirya is familiar with the code base of c++, Do you have any option？

viirya · 2022-08-26T07:21:07Z

Not sure which part of decimal validation you meant?

If you were talking about cast kernel, as I just quickly take a look, C++ implementation provides an option to choose between truncate decimal (it calls unsafe upscale/downscale), or do a safe rescale which will check if scaling is okay (no overflow/truncate) and rescaled value fits in new precision.

alamb · 2022-08-31T18:48:42Z

I think it is a safe option to follow the C/C++ interface. 👍

I don't really have a strong opinion here other than like @tustvold I would like a consistent rule that we follow.

Which is the right behavior for the cast when occurs overflow? @alamb @tustvold

I think the idea is that the user can call cast to get "default" behavior or cast_with_options to control whether they would like to generate an error or NULL if the cast wasn't successful

https://docs.rs/arrow/21.0.0/arrow/compute/kernels/cast/index.html

I suggest making Decimal the same with respect to casting behavior (as in follow the default behavior of the other numeric types)

tustvold · 2022-09-24T11:56:27Z

Proposal to make this handled consistently, among other things, is here - #2637 (comment)

tustvold · 2022-10-13T20:14:54Z

With #2857 precision validation is explicitly opt-in, as we add support for checked arithmetic this will be at the boundaries of the underlying i128 / i256 and not according to the precision. This is fine as these kernels will prevent truncation / overflow resulting in data loss, and if the user wishes to ensure precision is respected they can explicitly make a call to validate it

piyushdubey · 2024-03-24T16:32:56Z

Is there a fix for this in C# as well. I am seeing the same issue while parsing Decimals in C# and consistently getting the error

Decimal scale cannot be greater than that in the Arrow vector: 4 != 3

tustvold · 2024-03-24T17:50:19Z

You would need to ask on the main arrow repo, with a code reproducer. There is a good chance it isn't a bug

tustvold added the question Further information is requested label Aug 9, 2022

liukun4515 mentioned this issue Aug 11, 2022

Use Fixed-Length Array in BasicDecimal new and raw_value #2405

Merged

tustvold mentioned this issue Aug 23, 2022

optimize: no validation for decimal array for cast/take #2551

Closed

tustvold mentioned this issue Aug 23, 2022

Always validate the array data (except the Decimal) when creating array in IPC reader #2547

Merged

tustvold mentioned this issue Sep 2, 2022

Replace DecimalArray with PrimitiveArray #2637

Closed

liukun4515 mentioned this issue Sep 17, 2022

Dividing decimal type gives wrong error: "170141183460469231731687303715884105727 is too large to store in a Decimal128 apache/datafusion#3498

Closed

tustvold closed this as completed Oct 13, 2022

tustvold mentioned this issue Oct 14, 2022

Don't validate decimal precision in ArrayData (#2637) #2873

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 15, 2022

Validate decimal IPC read (apache#2387)

543bc63

tustvold mentioned this issue Oct 15, 2022

Validate decimal IPC read (#2387) #2880

Merged

tustvold added a commit that referenced this issue Oct 15, 2022

Validate decimal IPC read (#2387) (#2880)

f055f51

FiV0 mentioned this issue Jun 20, 2023

Round-tripping and operations on decimals xtdb/xtdb#2580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decimal Precision Validation #2387

Decimal Precision Validation #2387

tustvold commented Aug 9, 2022 •

edited

Loading

alamb commented Aug 9, 2022

HaoYang670 commented Aug 9, 2022

tustvold commented Aug 9, 2022 •

edited

Loading

alamb commented Aug 9, 2022

viirya commented Aug 9, 2022

tustvold commented Aug 10, 2022

alamb commented Aug 11, 2022

tustvold commented Aug 23, 2022 •

edited

Loading

liukun4515 commented Aug 23, 2022 •

edited

Loading

liukun4515 commented Aug 23, 2022

tustvold commented Aug 23, 2022 •

edited

Loading

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

tustvold commented Aug 23, 2022

liukun4515 commented Aug 24, 2022 •

edited

Loading

liukun4515 commented Aug 24, 2022

liukun4515 commented Aug 25, 2022

tustvold commented Aug 25, 2022

liukun4515 commented Aug 25, 2022

viirya commented Aug 26, 2022

alamb commented Aug 31, 2022

tustvold commented Sep 24, 2022 •

edited

Loading

tustvold commented Oct 13, 2022

piyushdubey commented Mar 24, 2024

tustvold commented Mar 24, 2024 •

edited

Loading

Decimal Precision Validation #2387

Decimal Precision Validation #2387

Comments

tustvold commented Aug 9, 2022 • edited Loading

alamb commented Aug 9, 2022

HaoYang670 commented Aug 9, 2022

tustvold commented Aug 9, 2022 • edited Loading

alamb commented Aug 9, 2022

viirya commented Aug 9, 2022

tustvold commented Aug 10, 2022

alamb commented Aug 11, 2022

tustvold commented Aug 23, 2022 • edited Loading

liukun4515 commented Aug 23, 2022 • edited Loading

liukun4515 commented Aug 23, 2022

tustvold commented Aug 23, 2022 • edited Loading

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

liukun4515 commented Aug 23, 2022

tustvold commented Aug 23, 2022

liukun4515 commented Aug 24, 2022 • edited Loading

liukun4515 commented Aug 24, 2022

liukun4515 commented Aug 25, 2022

tustvold commented Aug 25, 2022

liukun4515 commented Aug 25, 2022

viirya commented Aug 26, 2022

alamb commented Aug 31, 2022

tustvold commented Sep 24, 2022 • edited Loading

tustvold commented Oct 13, 2022

piyushdubey commented Mar 24, 2024

tustvold commented Mar 24, 2024 • edited Loading

tustvold commented Aug 9, 2022 •

edited

Loading

tustvold commented Aug 9, 2022 •

edited

Loading

tustvold commented Aug 23, 2022 •

edited

Loading

liukun4515 commented Aug 23, 2022 •

edited

Loading

tustvold commented Aug 23, 2022 •

edited

Loading

liukun4515 commented Aug 24, 2022 •

edited

Loading

tustvold commented Sep 24, 2022 •

edited

Loading

tustvold commented Mar 24, 2024 •

edited

Loading