-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Change precision of decimal without validation #2357
[WIP]Change precision of decimal without validation #2357
Conversation
39a19e1
to
97e1f26
Compare
Just change the API and change some usages as the example. two example for no validation for creating/changing the decimal:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method needs to be unsafe if it is skipping validation without knowing it is safe to do so? At least that would be consistent with the rest of the codebase where it is common to have unsafe *_unchecked methods. I suspect there are many places for decimals where we aren't following this and should be
Thoughts @viirya ?
Agreed that this should be unsafe.
Do you mean that we should do this unsafe style precision changing for many places but we haven't? |
let output_array = match output_precision-output_scale>=input_precision - input_scale { | ||
true => { | ||
output_array.with_precision_and_scale(*output_precision, *output_scale, false) | ||
} | ||
false => { | ||
output_array.with_precision_and_scale(*output_precision, *output_scale, true) | ||
} | ||
}?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
cast_decimal_to_decimal
already adjusts scale so seems only precision does matter here. with_precision_and_scale
also only does validation if new precision is less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cast decimal(5,2) to decimal(5,3).
a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still learning the logic. Here is just a nit:
let output_array = match output_precision-output_scale>=input_precision - input_scale { | |
true => { | |
output_array.with_precision_and_scale(*output_precision, *output_scale, false) | |
} | |
false => { | |
output_array.with_precision_and_scale(*output_precision, *output_scale, true) | |
} | |
}?; | |
let output_array = output_array.with_precision_and_scale( | |
*output_precision, | |
*output_scale, | |
output_precision - output_scale < input_precision - input_scale)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cast decimal(5,2) to decimal(5,3). a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).
This out-of-range should already be caught now as with_precision_and_scale
will do the validation.
But I got your point of the change here. with_precision_and_scale
will do validation here all the time. But for some cases output precision >= input precision + (output scale - input scale)
, we can skip the validation.
I'd suggest to rewrite the condition as output precision >= input precision + (output scale - input scale)
.
@@ -208,7 +210,7 @@ where | |||
)) | |||
} | |||
} | |||
.with_precision_and_scale(p, s)?; | |||
.with_precision_and_scale(p, s, false)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the codebase and arrow_reader from parquet, the decimal type of arrow is converted from the decimal type of parquet schema/meatdata.
physical type of parquet just contains below types:
PhysicalType::BOOLEAN => Arc::new(BooleanArray::from(array_data)) as ArrayRef,
PhysicalType::INT32 => Arc::new(Int32Array::from(array_data)) as ArrayRef,
PhysicalType::INT64 => Arc::new(Int64Array::from(array_data)) as ArrayRef,
PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)) as ArrayRef,
PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)) as ArrayRef,
PhysicalType::INT96
| PhysicalType::BYTE_ARRAY
| PhysicalType::FIXED_LEN_BYTE_ARRAY
The target_type
arrow type is from logical type of parquet file, so when reading from the parquet file, the data must be right or valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the data must be right or valid.
I'm not sure we can make this assumption, we validate UTF-8 strings for example...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can make this assumption, we validate UTF-8 strings for example...
Do you mean that the parquet data is may not match with its metadata or schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An API is unsound if you can violate invariants of a data structure, therefore potentially introducing UB, without using unsafe APIs to achieve this. As parquet IO is safe, if parquet did not validate the data on read it would be unsound.
This leads to a couple of options, and possibly more besides:
- Don't care about decimals exceeding the precision and don't make this an invariant
- We make this an invariant and validate on read
- We add an unsafe flag to skip validation on read
I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?
Why user can specify a new data type and not the type from parquet schema? If you can specify the type there may be other error when the specified type is not compatible with the data in the parquet. Although the current interface is this with specified data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?
In order to represent the decimal, some system or rule just use the siged integer to store the data, and scale/precision just as the metadata. We can get the exact value from this two parts.
For example decimal(3,1), the value of 10.1
will be stored/encoded as 101.
@tustvold
Maybe the parquet doc can help you
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L67
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
I am not sure whether this is related to this PR, but the |
Not same issue. There are many unnecessary precision validation for decimal array, such as reading parquet to arrow, casting smaller precision decimal arrow array to another bigger precision decimal arrow type. The validation of decimal array is heavy. |
/// Returns an Error if: | ||
/// 1. `precision` is larger than [`Self::MAX_PRECISION`] | ||
/// 2. `scale` is larger than [`Self::MAX_SCALE`]; | ||
/// 3. `scale` is > `precision` | ||
fn with_precision_and_scale(self, precision: usize, scale: usize) -> Result<U> | ||
/// 4. `need_validation` is `true`, but some values are out of ranges/bounds | ||
fn with_precision_and_scale(self, precision: usize, scale: usize, need_validation: bool) -> Result<U> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need_validation
is somewhat confused to me.
Do we need to check the precision, or check the scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just precision, because the decimal will to stored as a bigint without fractional part and just need to check the bounds/ranges for the max/min for the precision.
The max/min value is determined by precision.
For example decimal(4,n), the max bigint is 9999
and the min is -9999
, we should do validation for the input decimal or value in the decimal array if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I prefer renaming it and add more docs.
Thank you for working on this, however, I have a feeling that there is something fundamentally incorrect, or at the very least inconsistent, with the way validation for Decimal types is currently handled that warrants a more holistic review. In particular there are a number of APIs that should be unsafe, that aren't, the construction of I have asked on the ASF slack for clarification of what the purpose of the precision argument actually is, in particular what are the expected overflow/underflow characteristics. Once I have some clarity on what the correct behaviour is, I'll take a stab at fixing/filing the issues that arise. I'm sorry that this does mean there will be a slight delay, but I think it is better that we fix this properly, as I at least am extremely confused by what the current code does... |
// do validation. | ||
let output_array = match output_precision-output_scale>=input_precision - input_scale { | ||
true => { | ||
output_array.with_precision_and_scale(*output_precision, *output_scale, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does with_precision_and_scale
check if you set need_validation
to false?
It only checks:
1. output precision <=? max precision
2. output scale <=? max scale
3. output precision >= output scale I guess.
If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does
with_precision_and_scale
check if you setneed_validation
to false? It only checks: 1. output precision <=? max precision 2. output scale <=? max scale 3. output precision >= output scale I guess.If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.
You miss one point of the with_precision_and_scale
which will reset/set the DataType
for this array.
The decimaarray are created from below case
array
.iter()
.map(|v| v.map(|v| v.as_i128() * mul))
.collect::<Decimal128Array>()
, and need to reset/set the precision/scale for the result decimal array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.
Could you please share the code link, I don't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HaoYang670
create array by reading data from parquet
ArrowType::Decimal128(p, s) => { |
create a new decimal array from casting
arrow-rs/arrow/src/compute/kernels/cast.rs
Line 1255 in 9a630a1
fn cast_decimal_to_decimal( |
In cast case, if the target decimal type has larger range, we can ignore the validation.
For example, cast decimal(10,3) to decimal(12,3) don't need to do the validation, even int32 to decimal(12,0) also don't need to do validation.
I would recommend ensuring |
Possibly also related: #2362 |
What is the status of this PR? |
This pr will be closed. |
Which issue does this PR close?
Closes #2313
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?