-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: add isNaN and notNaN predicates #1747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
951699a
f33f5fc
80263ac
742d898
d7c3c3e
d5e6663
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ | |
| import org.apache.iceberg.transforms.Transform; | ||
| import org.apache.iceberg.transforms.Transforms; | ||
| import org.apache.iceberg.types.Types; | ||
| import org.apache.iceberg.util.NaNUtil; | ||
|
|
||
| /** | ||
| * Factory methods for creating {@link Expression expressions}. | ||
|
|
@@ -123,51 +124,79 @@ public static <T> UnboundPredicate<T> notNull(UnboundTerm<T> expr) { | |
| return new UnboundPredicate<>(Expression.Operation.NOT_NULL, expr); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> isNaN(String name) { | ||
| return new UnboundPredicate<>(Expression.Operation.IS_NAN, ref(name)); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> isNaN(UnboundTerm<T> expr) { | ||
| return new UnboundPredicate<>(Expression.Operation.IS_NAN, expr); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> notNaN(String name) { | ||
| return new UnboundPredicate<>(Expression.Operation.NOT_NAN, ref(name)); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) { | ||
| return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr); | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we also need to update the equality predicate to catch
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I originally thought to update Edit: what do you think about doing rewriting
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not fully understand what you mean by "rewrite logic of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So now since we want to handle NaN in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay so it's what I thought, just a bit confused by the notation. So for For
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.
I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted. If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:
I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. If the engine generally uses
I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.
Yes. This makes all of the handling in
I'm not convinced either way. You could argue that Feel free to ping me on Slack!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in |
||
|
|
||
| public static <T> UnboundPredicate<T> lessThan(String name, T value) { | ||
| validateInput("lessThan", value); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An easier way to do this is to add the check in It also ensures that we don't add factory methods later and forget to add the check to them.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! I didn't notice Thank you so much for your time reviewing this long PR! |
||
| return new UnboundPredicate<>(Expression.Operation.LT, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> lessThan(UnboundTerm<T> expr, T value) { | ||
| validateInput("lessThan", value); | ||
| return new UnboundPredicate<>(Expression.Operation.LT, expr, value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> lessThanOrEqual(String name, T value) { | ||
| validateInput("lessThanOrEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.LT_EQ, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> lessThanOrEqual(UnboundTerm<T> expr, T value) { | ||
| validateInput("lessThanOrEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.LT_EQ, expr, value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> greaterThan(String name, T value) { | ||
| validateInput("greaterThan", value); | ||
| return new UnboundPredicate<>(Expression.Operation.GT, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> greaterThan(UnboundTerm<T> expr, T value) { | ||
| validateInput("greaterThan", value); | ||
| return new UnboundPredicate<>(Expression.Operation.GT, expr, value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> greaterThanOrEqual(String name, T value) { | ||
| validateInput("greaterThanOrEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.GT_EQ, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> greaterThanOrEqual(UnboundTerm<T> expr, T value) { | ||
| validateInput("greaterThanOrEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.GT_EQ, expr, value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> equal(String name, T value) { | ||
| validateInput("equal", value); | ||
| return new UnboundPredicate<>(Expression.Operation.EQ, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> equal(UnboundTerm<T> expr, T value) { | ||
| validateInput("equal", value); | ||
| return new UnboundPredicate<>(Expression.Operation.EQ, expr, value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> notEqual(String name, T value) { | ||
| validateInput("notEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.NOT_EQ, ref(name), value); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> notEqual(UnboundTerm<T> expr, T value) { | ||
| validateInput("notEqual", value); | ||
| return new UnboundPredicate<>(Expression.Operation.NOT_EQ, expr, value); | ||
| } | ||
|
|
||
|
|
@@ -216,29 +245,43 @@ public static <T> UnboundPredicate<T> notIn(UnboundTerm<T> expr, Iterable<T> val | |
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> predicate(Operation op, String name, T value) { | ||
| validateInput(op.toString(), value); | ||
| return predicate(op, name, Literals.from(value)); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> predicate(Operation op, String name, Literal<T> lit) { | ||
| Preconditions.checkArgument(op != Operation.IS_NULL && op != Operation.NOT_NULL, | ||
| Preconditions.checkArgument( | ||
| op != Operation.IS_NULL && op != Operation.NOT_NULL && op != Operation.IS_NAN && op != Operation.NOT_NAN, | ||
| "Cannot create %s predicate inclusive a value", op); | ||
| return new UnboundPredicate<T>(op, ref(name), lit); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> predicate(Operation op, String name, Iterable<T> values) { | ||
| validateInput(op.toString(), values); | ||
| return predicate(op, ref(name), values); | ||
| } | ||
|
|
||
| public static <T> UnboundPredicate<T> predicate(Operation op, String name) { | ||
| Preconditions.checkArgument(op == Operation.IS_NULL || op == Operation.NOT_NULL, | ||
| Preconditions.checkArgument( | ||
| op == Operation.IS_NULL || op == Operation.NOT_NULL || op == Operation.IS_NAN || op == Operation.NOT_NAN, | ||
| "Cannot create %s predicate without a value", op); | ||
| return new UnboundPredicate<>(op, ref(name)); | ||
| } | ||
|
|
||
| private static <T> UnboundPredicate<T> predicate(Operation op, UnboundTerm<T> expr, Iterable<T> values) { | ||
| validateInput(op.toString(), values); | ||
| return new UnboundPredicate<>(op, expr, values); | ||
| } | ||
|
|
||
| private static <T> void validateInput(String op, T value) { | ||
| Preconditions.checkArgument(!NaNUtil.isNaN(value), String.format("Cannot create %s predicate with NaN", op)); | ||
| } | ||
|
|
||
| private static <T> void validateInput(String op, Iterable<T> values) { | ||
| Preconditions.checkArgument(Lists.newArrayList(values).stream().noneMatch(NaNUtil::isNaN), | ||
| String.format("Cannot create %s predicate with NaN", op)); | ||
| } | ||
|
|
||
| public static True alwaysTrue() { | ||
| return True.INSTANCE; | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,6 +76,7 @@ public boolean eval(ContentFile<?> file) { | |
| private class MetricsEvalVisitor extends BoundExpressionVisitor<Boolean> { | ||
| private Map<Integer, Long> valueCounts = null; | ||
| private Map<Integer, Long> nullCounts = null; | ||
| private Map<Integer, Long> nanCounts = null; | ||
| private Map<Integer, ByteBuffer> lowerBounds = null; | ||
| private Map<Integer, ByteBuffer> upperBounds = null; | ||
|
|
||
|
|
@@ -93,6 +94,7 @@ private boolean eval(ContentFile<?> file) { | |
|
|
||
| this.valueCounts = file.valueCounts(); | ||
| this.nullCounts = file.nullValueCounts(); | ||
| this.nanCounts = file.nanValueCounts(); | ||
| this.lowerBounds = file.lowerBounds(); | ||
| this.upperBounds = file.upperBounds(); | ||
|
|
||
|
|
@@ -150,6 +152,34 @@ public <T> Boolean notNull(BoundReference<T> ref) { | |
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| @Override | ||
| public <T> Boolean isNaN(BoundReference<T> ref) { | ||
| Integer id = ref.fieldId(); | ||
|
|
||
| if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) == 0) { | ||
| return ROWS_CANNOT_MATCH; | ||
| } | ||
|
|
||
| // when there's no nanCounts information, but we already know the column only contains null, | ||
| // it's guaranteed that there's no NaN value | ||
| if (containsNullsOnly(id)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we define a similar
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't define But I wasn't sure if we need it for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that the Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, I'll create a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the reasoning. If we have NaN counts, then we should have null counts. No need to over-complicated the null logic with a check for when we don't have null counts but do have NaN counts. Good catch! |
||
| return ROWS_CANNOT_MATCH; | ||
| } | ||
|
|
||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| @Override | ||
| public <T> Boolean notNaN(BoundReference<T> ref) { | ||
| Integer id = ref.fieldId(); | ||
|
|
||
| if (containsNaNsOnly(id)) { | ||
| return ROWS_CANNOT_MATCH; | ||
| } | ||
|
|
||
| return ROWS_MIGHT_MATCH; | ||
| } | ||
|
|
||
| @Override | ||
| public <T> Boolean lt(BoundReference<T> ref, Literal<T> lit) { | ||
| Integer id = ref.fieldId(); | ||
|
|
@@ -347,5 +377,10 @@ private boolean containsNullsOnly(Integer id) { | |
| nullCounts != null && nullCounts.containsKey(id) && | ||
| valueCounts.get(id) - nullCounts.get(id) == 0; | ||
| } | ||
|
|
||
| private boolean containsNaNsOnly(Integer id) { | ||
| return nanCounts != null && nanCounts.containsKey(id) && | ||
| valueCounts != null && nanCounts.get(id).equals(valueCounts.get(id)); | ||
| } | ||
| } | ||
| } | ||
Uh oh!
There was an error while loading. Please reload this page.