-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decimal Precision Validation #2387
Comments
cc @HaoYang670 @viirya @liukun4515 This is an excellent question -- "why do we need to validate Decimal Precision at all" -- it will likely drive design decisions such as what @HaoYang670 raised on #2362 |
As far as I know. The precision is actually the The
Because Decimal type is somewhat a mixture of static type (128bits or 256bits) and dynamic type (precision and scale). Which means we have to do validation at runtime to check overflow.
I guess the behavior is undefined. Because |
Given we don't care about this for other types, because it has serious performance implications, why would we care about it for decimals? I guess my question can be phrased as: Given decimal overflow cannot lead to undefined behaviour why do we need to check precision? A somewhat related point, is that in Rust signed integer overflow in rust is not undefined behaviour and is explicitly defined to wrap as twos complement, this is because checking at runtime is prohibitively expensive. I don't see an obvious reason we should treat decimals any differently, the overflow behaviour is perfectly well defined... |
Thought definitely they are not ideal 😆 But if someone cares we can add the rust style "checked_add()` etc kernels and by default leave it unchecked 🤔 |
I think that the precision for decimal specifies the representation range like other types do. It's useful so we can know at which moment an overflow can be happened and what range of values are possible to use with the type without overflow. As we don't have native decimal type in Rust, it is internally represented by bytes with maximum length for the maximum representation range. It could be easily specified with a value over its precision. For example, we can create a 128-bit decimal with precision 5 by giving it a value which is represent-able only by precision 10 (i.e., 10 digits). For native types, you might get a overlfow value, but for this case the 10-digits value still can be put into 128-bit bytes. We don't have it as overflow value there, I think. I think that is why the things become complicated. We may need to validate decimal values at some moments. For now we throw some errors for invalid (overflow) values. Instead, I think it might also make sense to change the invalid values to overflow representation to simulate overflow. Although I think for some systems, overflow will throw error anyway. But as mentioned:
I guess that decimal overflow should be no different. I think that interoperability is a concern. Because I'm not sure how other systems will interpret the decimal values from this crate. Assume we don't do any validation on decimals. If they just take the values without validation, I think the behavior might be unexpected. It could be an overflow at that system, or it truncates the value? |
Unexpected, perhaps, but not undefined. I'm not saying we don't provide methods to validate decimals, just that we make this explicitly opt-in? |
I don't have a strong opinion on this question, for what it is worth. Ensuring that array data has reasonable values for decimal seems ok to me (to fail fast rather than fail at some much later point) but if there is some compelling reason not to validate I could be convinced that way too |
I'm really struggling to review the decimal array work without an answer to this, as the current approach is wildly inconsistent to the point of being effectively meaningless... As it isn't consistently enforced, there is no reliable way to elide validation as you cannot assume validation has actually been performed
The result is that it is perfectly possible to construct a DecimalArray with overflowed values using safe APIs, consequently it is unclear how you can then meaningfully elide validation, or reason about if an operation will overflow, as you can't guarantee the input data wasn't already overflowed... It is also unclear what the meaning of validation is... For example, if you use I can see 3 possible paths forward: Strict Validation
This is the most "correct" interpretation, but has significant performance drawbacks, and is much stronger than the guarantees we provide for other arithmetic types. Loose Validation
This would allow opt-in overflow detection, but would require the user to perform validation after every arithmetic operation No Validation Continue to provide APIs to validate all values are within bounds, but don't attempt to detect or handle overflow within kernels. This is consistent with how we handle overflow elsewhere and is perfectly well defined.
My 2 cents on this is we only need to validate the invariants that arrow-rs needs to in order to prevent UB. If some other software component has additional invariants, it is on that component to verify them. Provided we clearly document the invariants we uphold, I don't think this is an issue. My preference would therefore be to option 3, as it is the simplest and fastest to implement, and is consistent with how we handle overflow elsewhere. |
In my opinion, the reason of doing validation for the decimal array is that we would like to make sure all the elements in the array are in within the range for the precision. For example, I call the But as far as Decimal data type, If I cast the |
Can you tell me where we handle the overflow? I can't get your thoughts |
In all the places where we are currently doing validation?
Only if you can assert that the data in the
No, which is why I am trying to argue for not bothering to do the same for decimals... But currently we sort of do and sort of don't and I don't really understand what is going on Edit: apparently we are doing checked casting for numerics, so I don't know anymore... |
other critical path for decimal, do we need to do validation when reading the parquet data to decimal? |
I have no preference, but just don't want to do unnecessary validation which will impact the performance of reading data/casting. |
We can do the strict validation, but just skip the point where we make sure it doesn't need the validation. If we follow the strict validation, we can make sure all the element in the I only find three scenes that we don't need to do validation when reading decimal data from parquet file (the schema is got from the parquet file metadata. The data/metadata in the parquet are matched); casting decimal array from the small range to the big range; take operation. |
I know, my point is we aren't doing strict validation currently so the optimisation is ill-formed until such a time as we are doing strict validation, if we wish to do so |
I find other thing about the
All data are out of the range int8, we will get But for decimal data type conversion, we will get I think the two results are not inconsistent. |
Maybe we can follow the behavior the arrow c++?make consistent with c++ version. |
any conclusion for this issue? @tustvold |
I thought you were going to take a look at the C++ implementation? |
Yes, I think i will finish it today. Maybe @viirya is familiar with the code base of c++, Do you have any option? |
Not sure which part of decimal validation you meant? If you were talking about cast kernel, as I just quickly take a look, C++ implementation provides an option to choose between truncate decimal (it calls unsafe upscale/downscale), or do a safe rescale which will check if scaling is okay (no overflow/truncate) and rescaled value fits in new precision. |
I think it is a safe option to follow the C/C++ interface. 👍 I don't really have a strong opinion here other than like @tustvold I would like a consistent rule that we follow.
I think the idea is that the user can call https://docs.rs/arrow/21.0.0/arrow/compute/kernels/cast/index.html I suggest making Decimal the same with respect to casting behavior (as in follow the default behavior of the other numeric types) |
Proposal to make this handled consistently, among other things, is here - #2637 (comment) |
With #2857 precision validation is explicitly opt-in, as we add support for checked arithmetic this will be at the boundaries of the underlying i128 / i256 and not according to the precision. This is fine as these kernels will prevent truncation / overflow resulting in data loss, and if the user wishes to ensure precision is respected they can explicitly make a call to validate it |
Is there a fix for this in C# as well. I am seeing the same issue while parsing Decimals in C# and consistently getting the error
|
You would need to ask on the main arrow repo, with a code reproducer. There is a good chance it isn't a bug |
Which part is this question about
Generally the approach taken by this crate is that a given
ArrayData
and by extensionArray
only contains valid data. For example, a StringArray is valid UTF-8 with each index at a codepoint boundary, a dictionary array only has valid indexes, etc... This allows eliding bound checks on access within kernels.However, in order for this to be sound, it must be impossible to create invalid
ArrayData
using safe APIs. This means that safe APIs must either:ArrayData::try_new
For the examples above incorrect validation can very clearly lead to UB. The situation for decimal values is a bit more confused, in particular I'm not really clear on what the implications of a value that exceeds the precision actually are. However, some notes:
Describe your question
My question boils down to:
The answers to this will dictate if we can just take a relaxed attitude to precision, and let users opt into validation if they care, and otherwise simply ignore it.
I tried to understand what the C++ implementation is doing, but I honestly got lost. It almost looks like it is performing floating point operations and then rounding them back, which seems surprising...
Additional context
The text was updated successfully, but these errors were encountered: