-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: clarify behavior of is_null, is_not_null, is_nan, is_finite, and is_infinite for nulls #285
Conversation
5e0801b
to
57d75b5
Compare
extensions/functions_comparison.yaml
Outdated
@@ -62,6 +62,9 @@ scalar_functions: | |||
description: Whether a value is null. | |||
impls: | |||
- args: | |||
- name: nan_is_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind elaborating on the motivation for adding this option? I'm wondering why an is_nan
function wouldn't suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different systems have different definitions of whether NaN is equivalent to NULL or not.
This is something we encountered when working on the Acero compute functions; https://issues.apache.org/jira/browse/ARROW-12959 and https://issues.apache.org/jira/browse/ARROW-12960.
If we wanted is_null()
to return TRUE
for NaN
values and TRUE
for NULL
, then I don't think mapping it to is_nan()
would achieve this, as that would return FALSE
for NULL
. Or would it return NULL
for NULL
? Either way, I guess having this explicitly as an option removes that ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason producers can't handle this kind of transform?
Most systems treat nan and null as distinct. As far as I know, it's only R and Pandas that support treating these two values equivalently in some contexts but not all.
Ideally we can avoid propagating the mistake of treating these two values as optionally equivalent as part of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.
extensions/functions_comparison.yaml
Outdated
@@ -78,9 +84,15 @@ scalar_functions: | |||
description: Whether a value is not a number. | |||
impls: | |||
- args: | |||
- name: value_for_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, what's the motivation to allow NULL to take on different logical meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T
for now (borrowing the name from Rust). In that case, this function is just is_nan(x)
, unwrap_or(is_nan(x), true)
, or unwrap_or(is_nan(x), false)
for the three options respectively, with the added benefit that value_for_null
can now be any expression.
extensions/functions_comparison.yaml
Outdated
description: Whether a value is not null. | ||
impls: | ||
- args: | ||
- name: nan_is_null | ||
options: [ TRUE, FALSE ] | ||
required: false | ||
- value: any1 | ||
return: BOOLEAN | ||
nullability: DECLARED_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is rather redundant. It's just not(is_null(x))
. What's the reason for having both?
extensions/functions_comparison.yaml
Outdated
@@ -78,9 +84,15 @@ scalar_functions: | |||
description: Whether a value is not a number. | |||
impls: | |||
- args: | |||
- name: value_for_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there is some function that assigns a default value to null values, it's also not primitive. I'm too lazy to look up if it does, but if not it probably should; I'll define it as unwrap_or(T?, T) -> T
for now (borrowing the name from Rust). In that case, this function is just is_nan(x)
, unwrap_or(is_nan(x), true)
, or unwrap_or(is_nan(x), false)
for the three options respectively, with the added benefit that value_for_null
can now be any expression.
extensions/functions_comparison.yaml
Outdated
@@ -62,6 +62,9 @@ scalar_functions: | |||
description: Whether a value is null. | |||
impls: | |||
- args: | |||
- name: nan_is_null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cpcloud. I don't see why these shenanigans couldn't be modeled without having this option. Let's not embed arbitrary and questionable special cases into Substrait core.
OK, cool, thanks for taking a look; will close this PR |
57d75b5
to
e22dd80
Compare
@cpcloud @jvanstraten In a separate conversation, @saulpw suggested that perhaps we could either update the function descriptions to clarify this difference and/or add a note to the types page to clarify it there. Would be good to get your thoughts on this - I'm thinking that if we don't want to add it to the description via my latest changes to this PR (I'm leaning towards not merging this given a lot of DBs don't operate this way anyway), it would still be good to add a few words to the docs to be explicit about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't argue with making sure things are abundantly clear :)
Thanks @thisisnic for doing this and thanks @cpcloud and @jvanstraten for the feedback. When @thisisnic and I initially discussed this idea of adding options for whether or not to treat |
@jvanstraten could you confirm that you would expect the return values of these expressions to be as follows (assuming the input has floating point type)?
This PR makes explicit the first two. Should we also make explicit the last three? |
I think so. Happy to add those in here once we've confirmed those are indeed our expected results. |
Actually I would expect The way I conceptualize a nullable type is as a sum type (aka tagged union) of the value null and the non-nullable base type. NaNs and infinities, on the other hand, are part of float types (thanks IEEE); they are part of the number. So a nullable float isn't a Are they already defined in Substrait or in a PR somewhere? ETA: I would also expect |
Thanks @jvanstraten. I think your rationale for @thisisnic does the above make sense to you too? If so, could you please modify the entries for |
a4d37ec
to
928bf43
Compare
This PR adds the option to control whether calling
is_null
andis_not_null
on anan
value returns true/false, and also adds the option to control is callingis_nan
on a null value returns true/false/null.