-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Pruning of floating point Parquet columns is incorrect when NaN
is present
#15812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @adriangb since you've been working in the predicate code recently and may have ideas how to tip off the pruning predicate generation code when floating point pruning is not safe. |
I'm not immediately sure. Is the point that the result of |
If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct. How do we handle this with nulls 🤔 |
Yes. Different systems treat
Yes, and I believe that's what parquet-java might already do. But many writers do write stats in this case, which leads to the usual backwards compatibility issues. So in my mind the ultimate solution is check
Different can of worms 😄. I'm not sure what parquet-rs does with a column of |
Where is |
It's an array in the I've been trying to trace where the plans get built from, and it seems like there are two paths...one from |
Long term I think it will only happen in |
Wow, thanks @adriangb! I'll start on it tomorrow! |
Describe the bug
This was mentioned in #15742 (comment) and discussed in detail in apache/parquet-format#221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form
x [gt|lt] literal
. Consider a column consisting of[1.0, 0.0, -1.0, NaN, -2.0]
, the max will be 1 and the min -2. A query likeselect * from ... where x > 2
will return no rows because no chunk exists wheremax > 2
.To Reproduce
Expected behavior
The above query should return a single row containing
NaN
.Additional context
The Parquet community is considering changes to allow for
NaN
in statistics, with the currently favored approach being adding a newColumnOrder
to the specification. This will correct the issue above, but datafusion will need to check theColumnOrder
to know whether or not floating point statistics can be trusted.Also note that if/when apache/parquet-format#221 is merged, other predicates such as
isnan(x)
might be candidates for pruning, but that is an optimization.The text was updated successfully, but these errors were encountered: