-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: Add support for row group pruning on FixedSizeBinary #9646
Conversation
acac9ec
to
588c5bc
Compare
588c5bc
to
4740dac
Compare
4740dac
to
496ba9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @progval -- as always this is looking quite good. However, I have a suggestion about data validation / error checking that i think is worth discussing prior to merge
Some(DataType::Decimal128(precision, scale)) => { | ||
Some(ScalarValue::Decimal128( | ||
Some(from_bytes_to_i128(s.$bytes_func())), | ||
*precision, | ||
*scale, | ||
)) | ||
} | ||
Some(DataType::FixedSizeBinary(size)) => { | ||
Some(ScalarValue::FixedSizeBinary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify that the resulting bytes have the correct size? As in verify that *size
is the same as (s.$bytes_func().to_vec().len()
?
If not, I suggest we ignore the error (but don't use the statistics) the same way as is done for non UTF8 values i string statistics. Perhaps we can at least debug when this happens with debug!
Since this data can come from any arbitrary parquet writer, I think it is good practice to validate what is in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure, good catch.
The code for strings is:
let s = std::str::from_utf8(s.$bytes_func())
.map(|s| s.to_string())
.ok();
Some(ScalarValue::Utf8(s))
which looks like it's treated as a NULL, rather than ignored as a statistics value. Should I do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think converting to NULL
is the correct behavior. In this case that results in ignoring the value in terms of range analysis. More details are here if you are curious:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, done. I also added a log statement for strings, for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @progval
Which issue does this PR close?
Closes #9645.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
It should be invisible other than a performance improvement.