Skip to content

Pruning of floating point Parquet columns is incorrect when NaN is present #15812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
etseidl opened this issue Apr 22, 2025 · 9 comments · May be fixed by #15821
Open

Pruning of floating point Parquet columns is incorrect when NaN is present #15812

etseidl opened this issue Apr 22, 2025 · 9 comments · May be fixed by #15821
Labels
bug Something isn't working

Comments

@etseidl
Copy link

etseidl commented Apr 22, 2025

Describe the bug

This was mentioned in #15742 (comment) and discussed in detail in apache/parquet-format#221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form x [gt|lt] literal. Consider a column consisting of [1.0, 0.0, -1.0, NaN, -2.0], the max will be 1 and the min -2. A query like select * from ... where x > 2 will return no rows because no chunk exists where max > 2.

To Reproduce

> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched. 

Expected behavior

The above query should return a single row containing NaN.

Additional context

The Parquet community is considering changes to allow for NaN in statistics, with the currently favored approach being adding a new ColumnOrder to the specification. This will correct the issue above, but datafusion will need to check the ColumnOrder to know whether or not floating point statistics can be trusted.

Also note that if/when apache/parquet-format#221 is merged, other predicates such as isnan(x) might be candidates for pruning, but that is an optimization.

@etseidl etseidl added the bug Something isn't working label Apr 22, 2025
@etseidl
Copy link
Author

etseidl commented Apr 22, 2025

cc @adriangb since you've been working in the predicate code recently and may have ideas how to tip off the pruning predicate generation code when floating point pruning is not safe.

@adriangb
Copy link
Contributor

I'm not immediately sure. Is the point that the result of max(2.0, NaN) depends on how you define the ordering of floating point numbers wrt NaN, which has two variations?

@adriangb
Copy link
Contributor

If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct.

How do we handle this with nulls 🤔

@etseidl
Copy link
Author

etseidl commented Apr 22, 2025

I'm not immediately sure. Is the point that the result of max(2.0, NaN) depends on how you define the ordering of floating point numbers wrt NaN, which has two variations?

Yes. Different systems treat NaN differently, but IIUC datafusion uses total order for floating point comparison, so (loosely) -NaN < -Inf < -x < -0 < 0 < x < Inf < NaN. This ordering is being proposed for Parquet as well.

If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct.

Yes, and I believe that's what parquet-java might already do. But many writers do write stats in this case, which leads to the usual backwards compatibility issues. So in my mind the ultimate solution is check ColumnOrder for the predicate column. If it uses the new IEEE_754_TOTAL_ORDER ordering, then proceed as usual. If it uses TYPE_DEFINED_ORDER, then we know NaN may be present but not accounted for, so don't do any pruning. The problem in my head is how to get the ColumnOrder info down into the plan generation code...or do we do extra plan rewriting at the parquet layer? I'm too new with datafusion to have a feel for where the correct place to handle this is.

How do we handle this with nulls 🤔

Different can of worms 😄. I'm not sure what parquet-rs does with a column of NaN mixed with null. Guess I'll go see...

@adriangb
Copy link
Contributor

Where is ColumnOrder available? In theory we have access to the full metadata of the parquet file when building the pruning predicate batches / predicate.

@etseidl
Copy link
Author

etseidl commented Apr 22, 2025

Where is ColumnOrder available? In theory we have access to the full metadata of the parquet file when building the pruning predicate batches / predicate.

It's an array in the FileMetaData, so defined once for the entire file.

I've been trying to trace where the plans get built from, and it seems like there are two paths...one from ParquetSource::with_predicate, which seems to happen before the file is opened, and then the path from ParquetOpener::open where we do have access to the file metadata.

@adriangb
Copy link
Contributor

Long term I think it will only happen in ParquetOpener::open #15769 precisely for reasons like this 😄

@adriangb
Copy link
Contributor

@etseidl I opened #15821 to prove we can pipe the info in. Would you like to continue that PR since you seem to know what the right thing to do with that information is?

@etseidl
Copy link
Author

etseidl commented Apr 23, 2025

Wow, thanks @adriangb! I'll start on it tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants