Skip to content

Conversation

chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Jul 28, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

implement spark If udf

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jul 28, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me -- thank you for the contribution @chenkovsky

very nicely tested 👏

I had a question but nothing I think would prevent this PR from merging

arg_types.len()
);
}
let Some(target_type) = comparison_coercion_numeric(&arg_types[1], &arg_types[2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the arguments have to be numeric> It seems like the rules should be:

  1. Argument 0 is boolean
  2. Argument 2 and 3 can be coerced to the same type

I am surprised that we don't also need to check the first argument for boolean as well 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn comparison_coercion_numeric(
comparison_coercion_numeric supports many different types not only numeric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that try_type_union_resolution is a better choice, so I updated the implementation

@alamb alamb added the spark label Aug 19, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenkovsky looks good to me

cc @shehabgamin and @andygrove

@shehabgamin
Copy link
Contributor

Thanks @chenkovsky looks good to me

cc @shehabgamin and @andygrove

Will review by EOD Friday PT


# Test basic true condition
query T
SELECT if(true, 'yes', 'no');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests that use more complex expressions, such as referencing columns in an input file?

Also, could you add tests that demonstrate that the 2nd expression is only evaluated when the 1st argument evaluates to true? There should be some tests in case.slt that can be repurposed.

By the way, in Comet, we just translate an if condition to a case expression.

impl IfExpr {
    /// Create a new IF expression
    pub fn new(
        if_expr: Arc<dyn PhysicalExpr>,
        true_expr: Arc<dyn PhysicalExpr>,
        false_expr: Arc<dyn PhysicalExpr>,
    ) -> Self {
        Self {
            if_expr: Arc::clone(&if_expr),
            true_expr: Arc::clone(&true_expr),
            false_expr: Arc::clone(&false_expr),
            case_expr: Arc::new(
                CaseExpr::try_new(None, vec![(if_expr, true_expr)], Some(false_expr)).unwrap(),
            ),
        }
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I wasn't clear on my comment above. It could be that the 2nd or 3rd argument expressions could fail if evaluated on certain rows, and we would expect if to provide conditional evaluation. For example:

select if(a==0, 0, b/a) from tbl

if a==0 then we want to avoid evaluating b/a because it would cause a divide by zero error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test that demonstrates the issue:

query II
SELECT v, IF(v < 0, 10/0, 1) FROM (VALUES (1), (2)) t(v)
----
1 1
2 1

This fails with DataFusion error: Arrow error: Divide by zero error.

The version of this test in cast.slt works correctly:

query II
SELECT v, CASE WHEN v < 0 THEN 10/0 ELSE 1 END FROM (VALUES (1), (2)) t(v)
----
1 1
2 1

Copy link
Contributor

@SparkApplicationMaster SparkApplicationMaster Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also check if when(...).otherwise(else_expr) used here:
https://github.com/apache/datafusion/pull/16946/files#diff-8184a681e2d4b84411030426011f4e80cc4a79e2debd39f6290d0159d83a63a5R97
does not eagerly calculate else_expr.
because I faced it calculating else_expr when using when expression: lakehq/sail#648
and the fix was just removing else_expr from when expr and placing it to the last branch of when with true predicate:
https://github.com/lakehq/sail/pull/649/files
also checked this now by adding in datafusion if.slt:

query I
SELECT case when true then 1 / 1 else 1 / 0 end;
----
1

and got this:

1. query failed: DataFusion error: Arrow error: Divide by zero error
[SQL] SELECT case when true then 1 / 1 else 1 / 0 end;

but this:

query I
SELECT case when a then 1 / 1 else 1 / b end FROM (VALUES (false, 1), (true, 0)) t(a, b);
----
1
1

works OK, maybe the problem with example above is only about eager constant evaluation in else_expr, not every expression

Not sure if eager evaluation of else expression is OK in datafusion, but in spark it's definitely NOT OK, and there is a doctest checking this:
https://github.com/apache/spark/blob/326052ec8280d8bf8ee1904504be3b62a72d3d29/python/pyspark/sql/column.py#L1418-L1427
raise_error(literal(str)) is also constant expression, so it evaluates when not intented to be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created another PR #17311

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does not handle conditional evaluation. I recommend that we delegate to CaseExpr instead, since if(a, b, c) is semantically equivalent to CASE WHEN a THEN b ELSE c END and case.slt testing is already quite comprehensive.

@shehabgamin
Copy link
Contributor

The current implementation does not handle conditional evaluation. I recommend that we delegate to CaseExpr instead, since if(a, b, c) is semantically equivalent to CASE WHEN a THEN b ELSE c END and case.slt testing is already quite comprehensive.

If helpful, this is how Sail has implemented the if function:
https://github.com/lakehq/sail/blob/main/crates/sail-plan/src/function/scalar/conditional.rs#L29-L37

fn if_expr(input: ScalarFunctionInput) -> PlanResult<expr::Expr> {
    let ScalarFunctionInput { arguments, .. } = input;
    let (when_expr, then_expr, else_expr) = arguments.three()?;
    Ok(expr::Expr::Case(expr::Case {
        expr: None,
        when_then_expr: vec![(Box::new(when_expr), Box::new(then_expr))],
        else_expr: Some(Box::new(else_expr)),
    }))
}

);
}

let target_types = try_type_union_resolution(&arg_types[1..])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to modify the second and third args? We can probably just do:

let result = vec![DataType::Boolean, arg_types[1], arg_types[2]];

Also, I can't recall if coerce_types is actually called when simplify is used. Did you happen to test it out by any chance?

Copy link
Contributor Author

@chenkovsky chenkovsky Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query failed: DataFusion error: Optimizer rule 'simplify_expressions' failed
caused by
Error during planning: CASE expression 'then' values had multiple data types: {Float64, Int64}

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenkovsky Thanks for updating the implementation to use case. Could you also add tests to demonstrate that this is working as intended?

@andygrove andygrove merged commit 99daafd into apache:main Aug 30, 2025
27 checks passed
@Jefffrey Jefffrey mentioned this pull request Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants