-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API, Spark: Optimize NOT IN and != predicates for single-value files #14593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The notEq and notIn optimizations now correctly prune files where transformed min == max values. Updated test expectations from 10 to 5 partitions for: - testUnpartitionedYears (years transform) - testUnpartitionedOr (years transform in OR clause) - testUnpartitionedTruncateString (truncate transform)
| // However, when min == max and the file has no nulls, we can safely prune | ||
| // if that value equals the literal. | ||
| int id = term.ref().fieldId(); | ||
| if (mayContainNull(id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this behave if there are NaNs in the data file? What happens, if there are only NaNs, hence both upper and lower bound is NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review comment @nandorKollar .
You're absolutely right. I've added NaN handling for both cases:
- Files with NaN values in the data
- Files with NaN bounds
Could I request you to please take another look at the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joyhaldar , I'll review it soon. Meanwhile, I found out, that probably we shouldn't worry too much about NaN's in lower and upper bound, the spec states that:
For float and double, the value -0.0 must precede +0.0, as in the IEEE 754 totalOrder predicate. NaNs are not permitted as lower or upper bounds.
Though looking at the tests, it seems that TestInclusiveMetricsEvaluator still tests against NaN in lower/upper bounds, maybe V1 spec still permitted this case?
cc. @pvary what do you think about this improvement, does it look promising to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again Nandor. You're right that it says that NaNs are not permitted as lower or upper bounds.
However, there's actually a note in the class javadoc that explains why we still may need to check,
Due to the comparison implementation of ORC stats, for float/double columns in ORC files, if the first value in a file is NaN, metrics of this file will report NaN for both upper and lower bound despite that the column could contain non-NaN data. Thus in some scenarios explicitly checks for NaN is necessary in order to not skip files that may contain matching data.
So the spec ideally prohibits NaN bounds, but ORC files in the wild can still have them based on the javadoc. I can also see that existing methods like lt(), ltEq(), eq() all include the NaNUtil.isNaN(bounds) check.
Let me know if you see it differently and if we should consider removing this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank this looks like a good reason to pay attention on NaN here too!
Check for NaN values and NaN bounds before applying single-value optimization to avoid incorrectly pruning files with NaN data.
| // However, when min == max and the file has no nulls or NaN values, we can safely prune | ||
| // if that value equals the literal. | ||
| int id = term.ref().fieldId(); | ||
| if (mayContainNull(id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about including mayContainNaN(id) in this branch:
if (mayContainNull(id) || mayContainNaN(id)) {
return ROWS_MIGHT_MATCH;
}
and leave out the other branch
if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion Nandor. I actually tried this initially but had to change it due to test failures.
The problem that I faced is mayContainNaN(id), which would be defined as nanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0; returns true when nanCounts == null or when the column has no entry in the map.
Tests that failed with mayContainNaN(id):
- testUnpartitionedYears(): expects 5 partitions, gets 10
- testUnpartitionedTruncateString(): expects 5 partitions, gets 10
- testUnpartitionedOr(): expects 5 partitions, gets 10
These tests use timestamp/string columns, and mayContainNaN() returns true for them (either because nanCounts == null or the column isn't in the map), preventing the optimization from running.
The current approach checks NaN two ways:
NaNUtil.isNaN(bounds)- returnsfalsefor timestamps/strings (they can't be NaN)nanCounts.get(id) != 0- only checks if stats actually exist
| } | ||
|
|
||
| if (lower.equals(upper)) { | ||
| int cmp = lit.comparator().compare(lower, lit.value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need another local variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review Manu. I was trying to keep the logic inline to match the existing code style in this file.
For example, the methods lt(), gt(), ltEq(), gtEq(), eq(), in() all have similar variables declared, including this one,
int cmp = lit.comparator().compare(lower, lit.value());
I thought extracting it would be inconsistent with how other methods are structured.
However, if you feel this is important for maintainability, I'm happy to remove the local variable declaration. Let me know what you think.
| // them. notIn(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col. | ||
| // However, when min == max and the file has no nulls or NaN values, we can safely prune | ||
| // if that value is in the exclusion set. | ||
| int id = term.ref().fieldId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put the similar logic into a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion Manu. I was trying to keep the logic inline to match the existing code style in this file.
For example, the methods eq(), in(), lt(), ltEq(), gt(), gtEq(), all have similar inline checks
I thought extracting it would be inconsistent with how other methods are structured.
However, if you feel this is important for maintainability, I'm happy to extract it. Let me know your preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to extract similar complex logic in a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I can do that, I will make the changes and request you for another review.
Summary
This PR adds file pruning optimization for
NOT INand!=predicates when a file contains a single distinct value (i.e., whenmin == max).Problem
Currently, InclusiveMetricsEvaluator cannot prune files for
NOT INand!=predicates, even when the file provably contains no matching rows.Solution
When
min == maxand the file has no nulls, we can safely prune if:NOT IN: the single value is in the exclusion list!=: the single value equals the literalThis works for both direct columns and transformed columns (e.g., years(), truncate()).
Testing
notInandnotEqoptimizationsFixes #14592