API, Spark: Optimize NOT IN and != predicates for single-value files #14593

joyhaldar · 2025-11-15T18:42:29Z

Summary

This PR adds file pruning optimization for NOT IN and != predicates when a file contains a single distinct value (i.e., when min == max).

Problem

Currently, InclusiveMetricsEvaluator cannot prune files for NOT IN and != predicates, even when the file provably contains no matching rows.

Solution

When min == max and the file has no nulls, we can safely prune if:

For NOT IN: the single value is in the exclusion list
For !=: the single value equals the literal

This works for both direct columns and transformed columns (e.g., years(), truncate()).

Testing

Added unit tests for both notIn and notEq optimizations
Verified correct behavior with nulls (must scan) and without nulls (can prune)
Updated Spark test expectations where optimization now prunes additional files

Fixes #14592

The notEq and notIn optimizations now correctly prune files where transformed min == max values. Updated test expectations from 10 to 5 partitions for: - testUnpartitionedYears (years transform) - testUnpartitionedOr (years transform in OR clause) - testUnpartitionedTruncateString (truncate transform)

nandorKollar · 2025-11-17T19:19:42Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

+      // However, when min == max and the file has no nulls, we can safely prune
+      // if that value equals the literal.
+      int id = term.ref().fieldId();
+      if (mayContainNull(id)) {


How should this behave if there are NaNs in the data file? What happens, if there are only NaNs, hence both upper and lower bound is NaN?

Thank you for your review comment @nandorKollar .

You're absolutely right. I've added NaN handling for both cases:

Files with NaN values in the data

Files with NaN bounds

Could I request you to please take another look at the PR?

Thanks @joyhaldar , I'll review it soon. Meanwhile, I found out, that probably we shouldn't worry too much about NaN's in lower and upper bound, the spec states that:
For float and double, the value -0.0 must precede +0.0, as in the IEEE 754 totalOrder predicate. NaNs are not permitted as lower or upper bounds.
Though looking at the tests, it seems that TestInclusiveMetricsEvaluator still tests against NaN in lower/upper bounds, maybe V1 spec still permitted this case?

cc. @pvary what do you think about this improvement, does it look promising to you?

Thank you again Nandor. You're right that it says that NaNs are not permitted as lower or upper bounds.

However, there's actually a note in the class javadoc that explains why we still may need to check,

Due to the comparison implementation of ORC stats, for float/double columns in ORC files, if the first value in a file is NaN, metrics of this file will report NaN for both upper and lower bound despite that the column could contain non-NaN data. Thus in some scenarios explicitly checks for NaN is necessary in order to not skip files that may contain matching data.

So the spec ideally prohibits NaN bounds, but ORC files in the wild can still have them based on the javadoc. I can also see that existing methods like lt(), ltEq(), eq() all include the NaNUtil.isNaN(bounds) check.

Let me know if you see it differently and if we should consider removing this check.

Thank this looks like a good reason to pay attention on NaN here too!

Check for NaN values and NaN bounds before applying single-value optimization to avoid incorrectly pruning files with NaN data.

nandorKollar · 2025-11-19T21:38:34Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

+      // However, when min == max and the file has no nulls or NaN values, we can safely prune
+      // if that value equals the literal.
+      int id = term.ref().fieldId();
+      if (mayContainNull(id)) {


How about including mayContainNaN(id) in this branch:

if (mayContainNull(id) || mayContainNaN(id)) { return ROWS_MIGHT_MATCH; }

and leave out the other branch
if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {

Thank you for the suggestion Nandor. I actually tried this initially but had to change it due to test failures.

The problem that I faced is mayContainNaN(id), which would be defined as nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0; returns true when nanCounts == null or when the column has no entry in the map.

Tests that failed with mayContainNaN(id):

testUnpartitionedYears(): expects 5 partitions, gets 10

testUnpartitionedTruncateString(): expects 5 partitions, gets 10

testUnpartitionedOr(): expects 5 partitions, gets 10

These tests use timestamp/string columns, and mayContainNaN() returns true for them (either because nanCounts == null or the column isn't in the map), preventing the optimization from running.

The current approach checks NaN two ways:

NaNUtil.isNaN(bounds) - returns false for timestamps/strings (they can't be NaN)

nanCounts.get(id) != 0 - only checks if stats actually exist

manuzhang · 2025-11-19T23:54:06Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

+      }
+
+      if (lower.equals(upper)) {
+        int cmp = lit.comparator().compare(lower, lit.value());


we don't need another local variable here.

Thank you for the review Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods lt(), gt(), ltEq(), gtEq(), eq(), in() all have similar variables declared, including this one,
int cmp = lit.comparator().compare(lower, lit.value());

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to remove the local variable declaration. Let me know what you think.

manuzhang · 2025-11-19T23:54:45Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

      // them. notIn(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col.
+      // However, when min == max and the file has no nulls or NaN values, we can safely prune
+      // if that value is in the exclusion set.
+      int id = term.ref().fieldId();


We can put the similar logic into a separate method.

Thank you for the suggestion Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods eq(), in(), lt(), ltEq(), gt(), gtEq(), all have similar inline checks

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to extract it. Let me know your preference.

I prefer to extract similar complex logic in a separate method.

Ok I can do that, I will make the changes and request you for another review.

Optimize NOT IN and != predicates to prune single-value files

930ceeb

github-actions bot added the API label Nov 15, 2025

Simplify comments in notEq and notIn methods

9bae0c6

joyhaldar changed the title ~~Optimize NOT IN and != predicates for single-value files~~ [WIP] - Optimize NOT IN and != predicates for single-value files Nov 17, 2025

github-actions bot added the spark label Nov 17, 2025

Remove outdated comment about notEq pruning in Spark tests

0e37256

joyhaldar changed the title ~~[WIP] - Optimize NOT IN and != predicates for single-value files~~ Optimize NOT IN and != predicates for single-value files Nov 17, 2025

nandorKollar reviewed Nov 17, 2025

View reviewed changes

Add NaN handling to notEq/notIn optimization

59fecea

Check for NaN values and NaN bounds before applying single-value optimization to avoid incorrectly pruning files with NaN data.

joyhaldar changed the title ~~Optimize NOT IN and != predicates for single-value files~~ WIP: Optimize NOT IN and != predicates for single-value files Nov 18, 2025

Check NaN via bounds and counts instead of mayContainNaN helper

bee3442

joyhaldar changed the title ~~WIP: Optimize NOT IN and != predicates for single-value files~~ Optimize NOT IN and != predicates for single-value files Nov 18, 2025

nandorKollar reviewed Nov 19, 2025

View reviewed changes

manuzhang reviewed Nov 19, 2025

View reviewed changes

manuzhang changed the title ~~Optimize NOT IN and != predicates for single-value files~~ API, Spark: Optimize NOT IN and != predicates for single-value files Nov 19, 2025

API, Spark: Optimize NOT IN and != predicates for single-value files #14593

Are you sure you want to change the base?

API, Spark: Optimize NOT IN and != predicates for single-value files #14593

Uh oh!

Conversation

joyhaldar commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Testing

Uh oh!

nandorKollar Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyhaldar Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyhaldar Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joyhaldar commented Nov 15, 2025 •

edited

Loading

nandorKollar Nov 17, 2025 •

edited

Loading

joyhaldar Nov 20, 2025 •

edited

Loading

joyhaldar Nov 20, 2025 •

edited

Loading