Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support decimal statistic for row group prune #2966

Merged
merged 2 commits into from
Jul 27, 2022

Conversation

liukun4515
Copy link
Contributor

@liukun4515 liukun4515 commented Jul 26, 2022

Which issue does this PR close?

Closes #2962

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 26, 2022
if !$column_statistics.has_min_max_set() {
return None;
}
match $column_statistics {
ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))),
ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))),
ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))),
ParquetStatistics::Int32(s) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value stored in the parquet file is unscaled number.
For example parquet decimal(9,2), the value of 1.23 will stored as 123 in the parquet file.

fn parquet_to_arrow_field(parquet_column: &ColumnDescriptor) -> Option<DataType> {
let type_ptr = parquet_column.self_type_ptr();
match type_ptr.get_basic_info().logical_type() {
// just handle the decimal type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I just handle the decimal case, and don't change the logic for other data types.
Maybe there are bugs for other data types, I will add tests to check them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename the function to parquet_to_arrow_decimal_field or something to make it more clear that this only applies to decimal types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the function name to parquet_to_arrow_decimal_type

@liukun4515 liukun4515 force-pushed the support_row_group_decimal_prune branch from 0e172a2 to 8fb2871 Compare July 26, 2022 12:53
@liukun4515 liukun4515 requested review from alamb, tustvold and viirya July 26, 2022 12:57
@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2022

Codecov Report

Merging #2966 (07bfc39) into master (7b0f2f8) will increase coverage by 0.11%.
The diff coverage is 98.15%.

❗ Current head 07bfc39 differs from pull request most recent head 3378e02. Consider uploading reports for the commit 3378e02 to get more accurate results

@@            Coverage Diff             @@
##           master    #2966      +/-   ##
==========================================
+ Coverage   85.62%   85.74%   +0.11%     
==========================================
  Files         279      280       +1     
  Lines       50965    51433     +468     
==========================================
+ Hits        43641    44101     +460     
- Misses       7324     7332       +8     
Impacted Files Coverage Δ
datafusion/core/tests/sql/joins.rs 99.31% <ø> (ø)
...ion/optimizer/src/rewrite_disjunctive_predicate.rs 96.75% <96.75%> (ø)
...sion/core/src/physical_plan/file_format/parquet.rs 95.44% <97.70%> (+0.43%) ⬆️
datafusion/optimizer/src/simplify_expressions.rs 84.02% <99.15%> (+2.02%) ⬆️
...afusion/core/src/datasource/file_format/parquet.rs 85.89% <100.00%> (+0.87%) ⬆️
datafusion/core/src/execution/context.rs 78.05% <100.00%> (+0.02%) ⬆️
datafusion/core/src/physical_optimizer/pruning.rs 94.17% <100.00%> (+0.38%) ⬆️
datafusion/core/tests/sql/predicates.rs 100.00% <100.00%> (ø)
datafusion/optimizer/src/reduce_outer_join.rs 98.79% <0.00%> (-0.61%) ⬇️
datafusion/common/src/scalar.rs 84.80% <0.00%> (-0.07%) ⬇️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@liukun4515 liukun4515 added the bug Something isn't working label Jul 26, 2022
// Convert the bytes array to i128.
// The endian of the input bytes array must be big-endian.
// Copy from the arrow-rs
fn from_bytes_to_i128(b: &[u8]) -> i128 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to simply call Decimal128::new() here? https://docs.rs/arrow/19.0.0/arrow/util/decimal/struct.Decimal128.html#method.new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decimal128::new() can't be apply to this, because the length of [u8] must be 16 and with default little-endian.
The array of [u8] decoded from parquet may be not align with 16 bytes, and the endian of it is big-endian.
About the definition of parquet decimal data, you can read the format spec

// 96 bit ints not supported
ParquetStatistics::Int96(_) => None,
ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))),
ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))),
ParquetStatistics::ByteArray(s) => {
// TODO support decimal type for byte array type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we track this in a ticket? Or maybe even implement it in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracked by #2970

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO support decimal type for byte array type
// TODO support decimal type for byte array type
// https://github.com/apache/arrow-datafusion/issues/2970

Comment on lines 527 to 529
/// Copy from arrow-rs
/// TODO: consolidate code with arrow-rs
/// TODO: change this API public in the arrow-rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please file a ticket in arrow-rs describing the change needed and add a reference to the new ticket here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After go through the codebase, In the datafusion we just need to consider the decimal data type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the TODO and don't need a ticket to track it.

fn parquet_to_arrow_field(parquet_column: &ColumnDescriptor) -> Option<DataType> {
let type_ptr = parquet_column.self_type_ptr();
match type_ptr.get_basic_info().logical_type() {
// just handle the decimal type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename the function to parquet_to_arrow_decimal_field or something to make it more clear that this only applies to decimal types

vec![0]
);

// INT32: c1 > 5, but parquet decimal type has different precision or scale to arrow decimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is a good case to test

Though I suggest you use something like decimal(5,0) where the difference in decimal types will result in significant errors rather than simply truncation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the arrow data type from decimal(9,2) to decimal(5,2).
This changes will bring a different precision and scale when compare/filter data.

PruningPredicate::try_new(expr, Arc::new(schema)).unwrap();
let rgm1 = get_row_group_meta_data(
&schema_descr,
// [100, 600]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't 100 in the parquet file mean 100 (rather than 1.00) if the type is Decimal(9, 0)?

Copy link
Contributor Author

@liukun4515 liukun4515 Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.
This is the parquet schema

        let schema_descr = get_test_schema_descr(vec![(
            "c1",
            PhysicalType::INT32,
            Some(LogicalType::Decimal {
                scale: 0,
                precision: 9,
            }),
            Some(9),
            Some(0),
            None,
        )]);

the schema of this parquet column is decimal(9,0), so the actual value of Some(100) is 100.

The parquet just store the unscaled number, this is define by the parquet spec format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema of this parquet column is decimal(9,0), so the actual value of Some(100) is 100.

Got it -- thank you

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liukun4515

PruningPredicate::try_new(expr, Arc::new(schema)).unwrap();
let rgm1 = get_row_group_meta_data(
&schema_descr,
// [100, 600]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema of this parquet column is decimal(9,0), so the actual value of Some(100) is 100.

Got it -- thank you

@alamb alamb merged commit c345f6d into apache:master Jul 27, 2022
@ursabot
Copy link

ursabot commented Jul 27, 2022

Benchmark runs are scheduled for baseline = 16f8934 and contender = c345f6d. c345f6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't filter rowgroup for parquet prune for some data type
4 participants