Pruning of floating point Parquet columns is incorrect when `NaN` is present #15812

etseidl · 2025-04-22T15:01:05Z

Describe the bug

This was mentioned in #15742 (comment) and discussed in detail in apache/parquet-format#221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form x [gt|lt] literal. Consider a column consisting of [1.0, 0.0, -1.0, NaN, -2.0], the max will be 1 and the min -2. A query like select * from ... where x > 2 will return no rows because no chunk exists where max > 2.

To Reproduce

> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched.

Expected behavior

The above query should return a single row containing NaN.

Additional context

The Parquet community is considering changes to allow for NaN in statistics, with the currently favored approach being adding a new ColumnOrder to the specification. This will correct the issue above, but datafusion will need to check the ColumnOrder to know whether or not floating point statistics can be trusted.

Also note that if/when apache/parquet-format#221 is merged, other predicates such as isnan(x) might be candidates for pruning, but that is an optimization.

The text was updated successfully, but these errors were encountered:

etseidl · 2025-04-22T15:03:59Z

cc @adriangb since you've been working in the predicate code recently and may have ideas how to tip off the pruning predicate generation code when floating point pruning is not safe.

adriangb · 2025-04-22T16:40:53Z

I'm not immediately sure. Is the point that the result of max(2.0, NaN) depends on how you define the ordering of floating point numbers wrt NaN, which has two variations?

adriangb · 2025-04-22T16:42:15Z

If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct.

How do we handle this with nulls 🤔

etseidl · 2025-04-22T16:59:27Z

I'm not immediately sure. Is the point that the result of max(2.0, NaN) depends on how you define the ordering of floating point numbers wrt NaN, which has two variations?

Yes. Different systems treat NaN differently, but IIUC datafusion uses total order for floating point comparison, so (loosely) -NaN < -Inf < -x < -0 < 0 < x < Inf < NaN. This ordering is being proposed for Parquet as well.

If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct.

Yes, and I believe that's what parquet-java might already do. But many writers do write stats in this case, which leads to the usual backwards compatibility issues. So in my mind the ultimate solution is check ColumnOrder for the predicate column. If it uses the new IEEE_754_TOTAL_ORDER ordering, then proceed as usual. If it uses TYPE_DEFINED_ORDER, then we know NaN may be present but not accounted for, so don't do any pruning. The problem in my head is how to get the ColumnOrder info down into the plan generation code...or do we do extra plan rewriting at the parquet layer? I'm too new with datafusion to have a feel for where the correct place to handle this is.

How do we handle this with nulls 🤔

Different can of worms 😄. I'm not sure what parquet-rs does with a column of NaN mixed with null. Guess I'll go see...

adriangb · 2025-04-22T17:08:00Z

Where is ColumnOrder available? In theory we have access to the full metadata of the parquet file when building the pruning predicate batches / predicate.

etseidl · 2025-04-22T17:36:49Z

Where is ColumnOrder available? In theory we have access to the full metadata of the parquet file when building the pruning predicate batches / predicate.

It's an array in the FileMetaData, so defined once for the entire file.

I've been trying to trace where the plans get built from, and it seems like there are two paths...one from ParquetSource::with_predicate, which seems to happen before the file is opened, and then the path from ParquetOpener::open where we do have access to the file metadata.

adriangb · 2025-04-22T17:43:54Z

Long term I think it will only happen in ParquetOpener::open #15769 precisely for reasons like this 😄

adriangb · 2025-04-23T04:21:13Z

@etseidl I opened #15821 to prove we can pipe the info in. Would you like to continue that PR since you seem to know what the right thing to do with that information is?

etseidl · 2025-04-23T04:48:54Z

Wow, thanks @adriangb! I'll start on it tomorrow!

etseidl added the bug Something isn't working label Apr 22, 2025

This was referenced Apr 22, 2025

Implement Parquet filter pushdown via new filter pushdown APIs #15769

Open

pipe column orderings into pruning predicate creation #15821

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pruning of floating point Parquet columns is incorrect when `NaN` is present #15812

Pruning of floating point Parquet columns is incorrect when `NaN` is present #15812

etseidl commented Apr 22, 2025

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

adriangb commented Apr 22, 2025

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

adriangb commented Apr 23, 2025

etseidl commented Apr 23, 2025

Pruning of floating point Parquet columns is incorrect when NaN is present #15812

Pruning of floating point Parquet columns is incorrect when NaN is present #15812

Comments

etseidl commented Apr 22, 2025

Describe the bug

To Reproduce

Expected behavior

Additional context

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

adriangb commented Apr 22, 2025

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

etseidl commented Apr 22, 2025

adriangb commented Apr 22, 2025

adriangb commented Apr 23, 2025

etseidl commented Apr 23, 2025

Pruning of floating point Parquet columns is incorrect when `NaN` is present #15812

Pruning of floating point Parquet columns is incorrect when `NaN` is present #15812