-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support decimal statistic for row group prune #2966
fix: support decimal statistic for row group prune #2966
Conversation
if !$column_statistics.has_min_max_set() { | ||
return None; | ||
} | ||
match $column_statistics { | ||
ParquetStatistics::Boolean(s) => Some(ScalarValue::Boolean(Some(*s.$func()))), | ||
ParquetStatistics::Int32(s) => Some(ScalarValue::Int32(Some(*s.$func()))), | ||
ParquetStatistics::Int64(s) => Some(ScalarValue::Int64(Some(*s.$func()))), | ||
ParquetStatistics::Int32(s) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value stored in the parquet file is unscaled number.
For example parquet decimal(9,2), the value of 1.23 will stored as 123 in the parquet file.
fn parquet_to_arrow_field(parquet_column: &ColumnDescriptor) -> Option<DataType> { | ||
let type_ptr = parquet_column.self_type_ptr(); | ||
match type_ptr.get_basic_info().logical_type() { | ||
// just handle the decimal type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I just handle the decimal case, and don't change the logic for other data types.
Maybe there are bugs for other data types, I will add tests to check them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename the function to parquet_to_arrow_decimal_field
or something to make it more clear that this only applies to decimal types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the function name to parquet_to_arrow_decimal_type
0e172a2
to
8fb2871
Compare
Codecov Report
@@ Coverage Diff @@
## master #2966 +/- ##
==========================================
+ Coverage 85.62% 85.74% +0.11%
==========================================
Files 279 280 +1
Lines 50965 51433 +468
==========================================
+ Hits 43641 44101 +460
- Misses 7324 7332 +8
Help us with your feedback. Take ten seconds to tell us how you rate us. |
// Convert the bytes array to i128. | ||
// The endian of the input bytes array must be big-endian. | ||
// Copy from the arrow-rs | ||
fn from_bytes_to_i128(b: &[u8]) -> i128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to simply call Decimal128::new()
here? https://docs.rs/arrow/19.0.0/arrow/util/decimal/struct.Decimal128.html#method.new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decimal128::new()
can't be apply to this, because the length of [u8] must be 16 and with default little-endian.
The array of [u8] decoded from parquet may be not align with 16 bytes, and the endian of it is big-endian.
About the definition of parquet decimal data, you can read the format spec
// 96 bit ints not supported | ||
ParquetStatistics::Int96(_) => None, | ||
ParquetStatistics::Float(s) => Some(ScalarValue::Float32(Some(*s.$func()))), | ||
ParquetStatistics::Double(s) => Some(ScalarValue::Float64(Some(*s.$func()))), | ||
ParquetStatistics::ByteArray(s) => { | ||
// TODO support decimal type for byte array type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we track this in a ticket? Or maybe even implement it in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracked by #2970
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO support decimal type for byte array type | |
// TODO support decimal type for byte array type | |
// https://github.com/apache/arrow-datafusion/issues/2970 |
/// Copy from arrow-rs | ||
/// TODO: consolidate code with arrow-rs | ||
/// TODO: change this API public in the arrow-rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please file a ticket in arrow-rs describing the change needed and add a reference to the new ticket here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After go through the codebase, In the datafusion we just need to consider the decimal data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove the TODO
and don't need a ticket to track it.
fn parquet_to_arrow_field(parquet_column: &ColumnDescriptor) -> Option<DataType> { | ||
let type_ptr = parquet_column.self_type_ptr(); | ||
match type_ptr.get_basic_info().logical_type() { | ||
// just handle the decimal type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename the function to parquet_to_arrow_decimal_field
or something to make it more clear that this only applies to decimal types
vec![0] | ||
); | ||
|
||
// INT32: c1 > 5, but parquet decimal type has different precision or scale to arrow decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is a good case to test
Though I suggest you use something like decimal(5,0)
where the difference in decimal types will result in significant errors rather than simply truncation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the arrow data type from decimal(9,2)
to decimal(5,2)
.
This changes will bring a different precision and scale when compare/filter data.
PruningPredicate::try_new(expr, Arc::new(schema)).unwrap(); | ||
let rgm1 = get_row_group_meta_data( | ||
&schema_descr, | ||
// [100, 600] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't 100
in the parquet file mean 100
(rather than 1.00
) if the type is Decimal(9, 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
This is the parquet schema
let schema_descr = get_test_schema_descr(vec![(
"c1",
PhysicalType::INT32,
Some(LogicalType::Decimal {
scale: 0,
precision: 9,
}),
Some(9),
Some(0),
None,
)]);
the schema of this parquet column is decimal(9,0), so the actual value of Some(100)
is 100.
The parquet just store the unscaled number, this is define by the parquet spec format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the schema of this parquet column is decimal(9,0), so the actual value of Some(100) is 100.
Got it -- thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liukun4515
PruningPredicate::try_new(expr, Arc::new(schema)).unwrap(); | ||
let rgm1 = get_row_group_meta_data( | ||
&schema_descr, | ||
// [100, 600] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the schema of this parquet column is decimal(9,0), so the actual value of Some(100) is 100.
Got it -- thank you
Benchmark runs are scheduled for baseline = 16f8934 and contender = c345f6d. c345f6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2962
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?