Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Change precision of decimal without validation #2357

Closed

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #2313

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Aug 8, 2022
@liukun4515 liukun4515 force-pushed the remove_unnecessary_decimal_valid branch from 39a19e1 to 97e1f26 Compare August 8, 2022 04:27
@liukun4515 liukun4515 requested review from tustvold, viirya and alamb and removed request for tustvold and viirya August 8, 2022 04:28
@liukun4515
Copy link
Contributor Author

Just change the API and change some usages as the example.

two example for no validation for creating/changing the decimal:

  1. read the decimal data from parquet
  2. cast one decimal to another. TODO: other primitive type to decimal

@liukun4515
Copy link
Contributor Author

PTAL this draft, this just an example for optimizing for decimal array.

If you have any option, please feel free to let me know.

@alamb @tustvold @viirya

@liukun4515 liukun4515 changed the title Change precision of decimal without validation [WIP]Change precision of decimal without validation Aug 8, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method needs to be unsafe if it is skipping validation without knowing it is safe to do so? At least that would be consistent with the rest of the codebase where it is common to have unsafe *_unchecked methods. I suspect there are many places for decimals where we aren't following this and should be

Thoughts @viirya ?

@viirya
Copy link
Member

viirya commented Aug 8, 2022

Agreed that this should be unsafe.

I suspect there are many places for decimals where we aren't following this and should be

Do you mean that we should do this unsafe style precision changing for many places but we haven't?

Comment on lines +1283 to +1290
let output_array = match output_precision-output_scale>=input_precision - input_scale {
true => {
output_array.with_precision_and_scale(*output_precision, *output_scale, false)
}
false => {
output_array.with_precision_and_scale(*output_precision, *output_scale, true)
}
}?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

cast_decimal_to_decimal already adjusts scale so seems only precision does matter here. with_precision_and_scale also only does validation if new precision is less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cast decimal(5,2) to decimal(5,3).
a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still learning the logic. Here is just a nit:

Suggested change
let output_array = match output_precision-output_scale>=input_precision - input_scale {
true => {
output_array.with_precision_and_scale(*output_precision, *output_scale, false)
}
false => {
output_array.with_precision_and_scale(*output_precision, *output_scale, true)
}
}?;
let output_array = output_array.with_precision_and_scale(
*output_precision,
*output_scale,
output_precision - output_scale < input_precision - input_scale)?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cast decimal(5,2) to decimal(5,3). a value of 123.45 will be convert to 123.450, but the 123450 is out of range of decimal(5,3).

This out-of-range should already be caught now as with_precision_and_scale will do the validation.

But I got your point of the change here. with_precision_and_scale will do validation here all the time. But for some cases output precision >= input precision + (output scale - input scale), we can skip the validation.

I'd suggest to rewrite the condition as output precision >= input precision + (output scale - input scale).

@@ -208,7 +210,7 @@ where
))
}
}
.with_precision_and_scale(p, s)?;
.with_precision_and_scale(p, s, false)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the codebase and arrow_reader from parquet, the decimal type of arrow is converted from the decimal type of parquet schema/meatdata.

physical type of parquet just contains below types:

PhysicalType::BOOLEAN => Arc::new(BooleanArray::from(array_data)) as ArrayRef,
            PhysicalType::INT32 => Arc::new(Int32Array::from(array_data)) as ArrayRef,
            PhysicalType::INT64 => Arc::new(Int64Array::from(array_data)) as ArrayRef,
            PhysicalType::FLOAT => Arc::new(Float32Array::from(array_data)) as ArrayRef,
            PhysicalType::DOUBLE => Arc::new(Float64Array::from(array_data)) as ArrayRef,
            PhysicalType::INT96
            | PhysicalType::BYTE_ARRAY
            | PhysicalType::FIXED_LEN_BYTE_ARRAY

The target_type arrow type is from logical type of parquet file, so when reading from the parquet file, the data must be right or valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data must be right or valid.

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can make this assumption, we validate UTF-8 strings for example...

Do you mean that the parquet data is may not match with its metadata or schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An API is unsound if you can violate invariants of a data structure, therefore potentially introducing UB, without using unsafe APIs to achieve this. As parquet IO is safe, if parquet did not validate the data on read it would be unsound.

This leads to a couple of options, and possibly more besides:

  1. Don't care about decimals exceeding the precision and don't make this an invariant
  2. We make this an invariant and validate on read
  3. We add an unsafe flag to skip validation on read

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how do we know that we don't need to do validation here? The decimal type can be specified, what if user specify a smaller precision?

Why user can specify a new data type and not the type from parquet schema? If you can specify the type there may be other error when the specified type is not compatible with the data in the parquet. Although the current interface is this with specified data type.

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a step back before undertaking any of these, as frankly I am deeply confused by what this precision argument is actually for - why arbitrarily truncate your value space?

In order to represent the decimal, some system or rule just use the siged integer to store the data, and scale/precision just as the metadata. We can get the exact value from this two parts.

For example decimal(3,1), the value of 10.1 will be stored/encoded as 101.

@tustvold
Maybe the parquet doc can help you
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L67
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

@HaoYang670
Copy link
Contributor

I suspect there are many places for decimals where we aren't following this and should be

I am not sure whether this is related to this PR, but the new method should also be unsafe: https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

@liukun4515
Copy link
Contributor Author

I suspect there are many places for decimals where we aren't following this and should be

I am not sure whether this is related to this PR, but the new method should also be unsafe: https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

Not same issue.

There are many unnecessary precision validation for decimal array, such as reading parquet to arrow, casting smaller precision decimal arrow array to another bigger precision decimal arrow type.

The validation of decimal array is heavy.

/// Returns an Error if:
/// 1. `precision` is larger than [`Self::MAX_PRECISION`]
/// 2. `scale` is larger than [`Self::MAX_SCALE`];
/// 3. `scale` is > `precision`
fn with_precision_and_scale(self, precision: usize, scale: usize) -> Result<U>
/// 4. `need_validation` is `true`, but some values are out of ranges/bounds
fn with_precision_and_scale(self, precision: usize, scale: usize, need_validation: bool) -> Result<U>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need_validation is somewhat confused to me.
Do we need to check the precision, or check the scale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just precision, because the decimal will to stored as a bigint without fractional part and just need to check the bounds/ranges for the max/min for the precision.
The max/min value is determined by precision.
For example decimal(4,n), the max bigint is 9999 and the min is -9999, we should do validation for the input decimal or value in the decimal array if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I prefer renaming it and add more docs.

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

Thank you for working on this, however, I have a feeling that there is something fundamentally incorrect, or at the very least inconsistent, with the way validation for Decimal types is currently handled that warrants a more holistic review. In particular there are a number of APIs that should be unsafe, that aren't, the construction of BasicDecimal has unnecessary bounds checking, etc...

I have asked on the ASF slack for clarification of what the purpose of the precision argument actually is, in particular what are the expected overflow/underflow characteristics. Once I have some clarity on what the correct behaviour is, I'll take a stab at fixing/filing the issues that arise.

I'm sorry that this does mean there will be a slight delay, but I think it is better that we fix this properly, as I at least am extremely confused by what the current code does...

// do validation.
let output_array = match output_precision-output_scale>=input_precision - input_scale {
true => {
output_array.with_precision_and_scale(*output_precision, *output_scale, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does with_precision_and_scale check if you set need_validation to false?
It only checks:
1. output precision <=? max precision
2. output scale <=? max scale
3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does with_precision_and_scale check if you set need_validation to false? It only checks: 1. output precision <=? max precision 2. output scale <=? max scale 3. output precision >= output scale I guess.

If this is the truth, why do we check these things after we doing the casting? We should check the output precision and scale at the beginning of this method.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

The decimaarray are created from below case

array
            .iter()
            .map(|v| v.map(|v| v.as_i128() * mul))
            .collect::<Decimal128Array>()

, and need to reset/set the precision/scale for the result decimal array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You miss one point of the with_precision_and_scale which will reset/set the DataType for this array.

Could you please share the code link, I don't find it.

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670
create array by reading data from parquet

ArrowType::Decimal128(p, s) => {

create a new decimal array from casting

fn cast_decimal_to_decimal(

In cast case, if the target decimal type has larger range, we can ignore the validation.
For example, cast decimal(10,3) to decimal(12,3) don't need to do the validation, even int32 to decimal(12,0) also don't need to do validation.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2022

I am not sure whether this is related to this PR, but the new method should also be unsafe:
https://github.com/apache/arrow-rs/blob/master/arrow/src/util/decimal.rs#L81-L86

I would recommend ensuringnew() created valid DecimalArrays rather than marking it unsafe. Possibly related to this PR: #2360

@alamb
Copy link
Contributor

alamb commented Aug 8, 2022

Possibly also related: #2362

@alamb
Copy link
Contributor

alamb commented Aug 17, 2022

What is the status of this PR?

@liukun4515
Copy link
Contributor Author

What is the status of this PR?

This pr will be closed.
I am ready to file some small sub-pr to push this forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize decimal: reduce validation when construct the decimal array or cast to the decimal array
5 participants