Skip to content

Conversation

@joyhaldar
Copy link

@joyhaldar joyhaldar commented Nov 15, 2025

Summary

This PR adds file pruning optimization for NOT IN and != predicates when a file contains a single distinct value (i.e., when min == max).

Problem

Currently, InclusiveMetricsEvaluator cannot prune files for NOT IN and != predicates, even when the file provably contains no matching rows.

Solution

When min == max and the file has no nulls, we can safely prune if:

  • For NOT IN: the single value is in the exclusion list
  • For !=: the single value equals the literal

This works for both direct columns and transformed columns (e.g., years(), truncate()).

Testing

  • Added unit tests for both notIn and notEq optimizations
  • Verified correct behavior with nulls (must scan) and without nulls (can prune)
  • Updated Spark test expectations where optimization now prunes additional files

Fixes #14592

@github-actions github-actions bot added the API label Nov 15, 2025
@joyhaldar joyhaldar changed the title Optimize NOT IN and != predicates for single-value files [WIP] - Optimize NOT IN and != predicates for single-value files Nov 17, 2025
The notEq and notIn optimizations now correctly prune files where transformed
min == max values. Updated test expectations from 10 to 5 partitions for:
- testUnpartitionedYears (years transform)
- testUnpartitionedOr (years transform in OR clause)
- testUnpartitionedTruncateString (truncate transform)
@github-actions github-actions bot added the spark label Nov 17, 2025
@joyhaldar joyhaldar changed the title [WIP] - Optimize NOT IN and != predicates for single-value files Optimize NOT IN and != predicates for single-value files Nov 17, 2025
// However, when min == max and the file has no nulls, we can safely prune
// if that value equals the literal.
int id = term.ref().fieldId();
if (mayContainNull(id)) {
Copy link
Contributor

@nandorKollar nandorKollar Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this behave if there are NaNs in the data file? What happens, if there are only NaNs, hence both upper and lower bound is NaN?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review comment @nandorKollar .

You're absolutely right. I've added NaN handling for both cases:

  • Files with NaN values in the data
  • Files with NaN bounds

Could I request you to please take another look at the PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joyhaldar , I'll review it soon. Meanwhile, I found out, that probably we shouldn't worry too much about NaN's in lower and upper bound, the spec states that:
For float and double, the value -0.0 must precede +0.0, as in the IEEE 754 totalOrder predicate. NaNs are not permitted as lower or upper bounds.
Though looking at the tests, it seems that TestInclusiveMetricsEvaluator still tests against NaN in lower/upper bounds, maybe V1 spec still permitted this case?

cc. @pvary what do you think about this improvement, does it look promising to you?

Copy link
Author

@joyhaldar joyhaldar Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again Nandor. You're right that it says that NaNs are not permitted as lower or upper bounds.

However, there's actually a note in the class javadoc that explains why we still may need to check,

Due to the comparison implementation of ORC stats, for float/double columns in ORC files, if the first value in a file is NaN, metrics of this file will report NaN for both upper and lower bound despite that the column could contain non-NaN data. Thus in some scenarios explicitly checks for NaN is necessary in order to not skip files that may contain matching data.

So the spec ideally prohibits NaN bounds, but ORC files in the wild can still have them based on the javadoc. I can also see that existing methods like lt(), ltEq(), eq() all include the NaNUtil.isNaN(bounds) check.

Let me know if you see it differently and if we should consider removing this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank this looks like a good reason to pay attention on NaN here too!

Check for NaN values and NaN bounds before applying single-value
optimization to avoid incorrectly pruning files with NaN data.
@joyhaldar joyhaldar changed the title Optimize NOT IN and != predicates for single-value files WIP: Optimize NOT IN and != predicates for single-value files Nov 18, 2025
@joyhaldar joyhaldar changed the title WIP: Optimize NOT IN and != predicates for single-value files Optimize NOT IN and != predicates for single-value files Nov 18, 2025
// However, when min == max and the file has no nulls or NaN values, we can safely prune
// if that value equals the literal.
int id = term.ref().fieldId();
if (mayContainNull(id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about including mayContainNaN(id) in this branch:

if (mayContainNull(id) || mayContainNaN(id)) {
   return ROWS_MIGHT_MATCH;
}

and leave out the other branch
if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {

Copy link
Author

@joyhaldar joyhaldar Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion Nandor. I actually tried this initially but had to change it due to test failures.

The problem that I faced is mayContainNaN(id), which would be defined as nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0; returns true when nanCounts == null or when the column has no entry in the map.

Tests that failed with mayContainNaN(id):

These tests use timestamp/string columns, and mayContainNaN() returns true for them (either because nanCounts == null or the column isn't in the map), preventing the optimization from running.

The current approach checks NaN two ways:

  1. NaNUtil.isNaN(bounds) - returns false for timestamps/strings (they can't be NaN)
  2. nanCounts.get(id) != 0 - only checks if stats actually exist

}

if (lower.equals(upper)) {
int cmp = lit.comparator().compare(lower, lit.value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need another local variable here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods lt(), gt(), ltEq(), gtEq(), eq(), in() all have similar variables declared, including this one,
int cmp = lit.comparator().compare(lower, lit.value());

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to remove the local variable declaration. Let me know what you think.

// them. notIn(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col.
// However, when min == max and the file has no nulls or NaN values, we can safely prune
// if that value is in the exclusion set.
int id = term.ref().fieldId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the similar logic into a separate method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion Manu. I was trying to keep the logic inline to match the existing code style in this file.

For example, the methods eq(), in(), lt(), ltEq(), gt(), gtEq(), all have similar inline checks

I thought extracting it would be inconsistent with how other methods are structured.

However, if you feel this is important for maintainability, I'm happy to extract it. Let me know your preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to extract similar complex logic in a separate method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can do that, I will make the changes and request you for another review.

@manuzhang manuzhang changed the title Optimize NOT IN and != predicates for single-value files API, Spark: Optimize NOT IN and != predicates for single-value files Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NOT IN and != predicates do not prune files when min == max

3 participants