-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API, Spark: Optimize NOT IN and != predicates for single-value files #14593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
930ceeb
9bae0c6
b828a47
0e37256
59fecea
bee3442
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -327,6 +327,29 @@ public <T> Boolean eq(Bound<T> term, Literal<T> lit) { | |
| public <T> Boolean notEq(Bound<T> term, Literal<T> lit) { | ||
| // because the bounds are not necessarily a min or max value, this cannot be answered using | ||
| // them. notEq(col, X) with (X, Y) doesn't guarantee that X is a value in col. | ||
| // However, when min == max and the file has no nulls or NaN values, we can safely prune | ||
| // if that value equals the literal. | ||
| int id = term.ref().fieldId(); | ||
| if (mayContainNull(id)) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
| T lower = lowerBound(term); | ||
| T upper = upperBound(term); | ||
|
|
||
| if (lower == null || upper == null || NaNUtil.isNaN(lower) || NaNUtil.isNaN(upper)) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| if (lower.equals(upper)) { | ||
| int cmp = lit.comparator().compare(lower, lit.value()); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we don't need another local variable here.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the review Manu. I was trying to keep the logic inline to match the existing code style in this file. For example, the methods I thought extracting it would be inconsistent with how other methods are structured. However, if you feel this is important for maintainability, I'm happy to remove the local variable declaration. Let me know what you think. |
||
| if (cmp == 0) { | ||
| return ROWS_CANNOT_MATCH; | ||
| } | ||
| } | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
|
|
@@ -381,6 +404,28 @@ public <T> Boolean in(Bound<T> term, Set<T> literalSet) { | |
| public <T> Boolean notIn(Bound<T> term, Set<T> literalSet) { | ||
| // because the bounds are not necessarily a min or max value, this cannot be answered using | ||
| // them. notIn(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value in col. | ||
| // However, when min == max and the file has no nulls or NaN values, we can safely prune | ||
| // if that value is in the exclusion set. | ||
| int id = term.ref().fieldId(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can put the similar logic into a separate method.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the suggestion Manu. I was trying to keep the logic inline to match the existing code style in this file. For example, the methods I thought extracting it would be inconsistent with how other methods are structured. However, if you feel this is important for maintainability, I'm happy to extract it. Let me know your preference.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer to extract similar complex logic in a separate method.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I can do that, I will make the changes and request you for another review. |
||
| if (mayContainNull(id)) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
| T lower = lowerBound(term); | ||
| T upper = upperBound(term); | ||
|
|
||
| if (lower == null || upper == null || NaNUtil.isNaN(lower) || NaNUtil.isNaN(upper)) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) { | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| if (lower.equals(upper)) { | ||
| if (literalSet.contains(lower)) { | ||
| return ROWS_CANNOT_MATCH; | ||
| } | ||
| } | ||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about including mayContainNaN(id) in this branch:
and leave out the other branch
if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) != 0) {Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion Nandor. I actually tried this initially but had to change it due to test failures.
The problem that I faced is
mayContainNaN(id), which would be defined asnanCounts == null || !nanCounts.containsKey(id) || nanCounts.get(id) != 0;returnstruewhennanCounts == nullor when the column has no entry in the map.Tests that failed with
mayContainNaN(id):These tests use timestamp/string columns, and
mayContainNaN()returnstruefor them (either becausenanCounts == nullor the column isn't in the map), preventing the optimization from running.The current approach checks NaN two ways:
NaNUtil.isNaN(bounds)- returnsfalsefor timestamps/strings (they can't be NaN)nanCounts.get(id) != 0- only checks if stats actually exist