-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat(spark): implement Spark conditional function if #16946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me -- thank you for the contribution @chenkovsky
very nicely tested 👏
I had a question but nothing I think would prevent this PR from merging
arg_types.len() | ||
); | ||
} | ||
let Some(target_type) = comparison_coercion_numeric(&arg_types[1], &arg_types[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do the arguments have to be numeric> It seems like the rules should be:
- Argument 0 is boolean
- Argument 2 and 3 can be coerced to the same type
I am surprised that we don't also need to check the first argument for boolean as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn comparison_coercion_numeric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that try_type_union_resolution is a better choice, so I updated the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chenkovsky looks good to me
cc @shehabgamin and @andygrove
Will review by EOD Friday PT |
|
||
# Test basic true condition | ||
query T | ||
SELECT if(true, 'yes', 'no'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests that use more complex expressions, such as referencing columns in an input file?
Also, could you add tests that demonstrate that the 2nd expression is only evaluated when the 1st argument evaluates to true? There should be some tests in case.slt
that can be repurposed.
By the way, in Comet, we just translate an if
condition to a case
expression.
impl IfExpr {
/// Create a new IF expression
pub fn new(
if_expr: Arc<dyn PhysicalExpr>,
true_expr: Arc<dyn PhysicalExpr>,
false_expr: Arc<dyn PhysicalExpr>,
) -> Self {
Self {
if_expr: Arc::clone(&if_expr),
true_expr: Arc::clone(&true_expr),
false_expr: Arc::clone(&false_expr),
case_expr: Arc::new(
CaseExpr::try_new(None, vec![(if_expr, true_expr)], Some(false_expr)).unwrap(),
),
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I wasn't clear on my comment above. It could be that the 2nd or 3rd argument expressions could fail if evaluated on certain rows, and we would expect if
to provide conditional evaluation. For example:
select if(a==0, 0, b/a) from tbl
if a==0
then we want to avoid evaluating b/a
because it would cause a divide by zero error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a test that demonstrates the issue:
query II
SELECT v, IF(v < 0, 10/0, 1) FROM (VALUES (1), (2)) t(v)
----
1 1
2 1
This fails with DataFusion error: Arrow error: Divide by zero error
.
The version of this test in cast.slt
works correctly:
query II
SELECT v, CASE WHEN v < 0 THEN 10/0 ELSE 1 END FROM (VALUES (1), (2)) t(v)
----
1 1
2 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also check if when(...).otherwise(else_expr
) used here:
https://github.com/apache/datafusion/pull/16946/files#diff-8184a681e2d4b84411030426011f4e80cc4a79e2debd39f6290d0159d83a63a5R97
does not eagerly calculate else_expr
.
because I faced it calculating else_expr
when using when
expression: lakehq/sail#648
and the fix was just removing else_expr from when expr and placing it to the last branch of when with true predicate:
https://github.com/lakehq/sail/pull/649/files
also checked this now by adding in datafusion if.slt:
query I
SELECT case when true then 1 / 1 else 1 / 0 end;
----
1
and got this:
1. query failed: DataFusion error: Arrow error: Divide by zero error
[SQL] SELECT case when true then 1 / 1 else 1 / 0 end;
but this:
query I
SELECT case when a then 1 / 1 else 1 / b end FROM (VALUES (false, 1), (true, 0)) t(a, b);
----
1
1
works OK, maybe the problem with example above is only about eager constant evaluation in else_expr, not every expression
Not sure if eager evaluation of else expression is OK in datafusion, but in spark it's definitely NOT OK, and there is a doctest checking this:
https://github.com/apache/spark/blob/326052ec8280d8bf8ee1904504be3b62a72d3d29/python/pyspark/sql/column.py#L1418-L1427
raise_error(literal(str))
is also constant expression, so it evaluates when not intented to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created another PR #17311
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not handle conditional evaluation. I recommend that we delegate to CaseExpr
instead, since if(a, b, c)
is semantically equivalent to CASE WHEN a THEN b ELSE c END
and case.slt
testing is already quite comprehensive.
If helpful, this is how Sail has implemented the
|
); | ||
} | ||
|
||
let target_types = try_type_union_resolution(&arg_types[1..])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to modify the second and third args? We can probably just do:
let result = vec![DataType::Boolean, arg_types[1], arg_types[2]];
Also, I can't recall if coerce_types
is actually called when simplify
is used. Did you happen to test it out by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query failed: DataFusion error: Optimizer rule 'simplify_expressions' failed
caused by
Error during planning: CASE expression 'then' values had multiple data types: {Float64, Int64}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenkovsky Thanks for updating the implementation to use case
. Could you also add tests to demonstrate that this is working as intended?
Which issue does this PR close?
datafusion-spark
Spark Compatible Functions #15914Rationale for this change
What changes are included in this PR?
implement spark If udf
Are these changes tested?
UT
Are there any user-facing changes?
No