Ensure that math functions fulfil the ColumnarValue contract #12922

joroKr21 · 2024-10-14T15:25:47Z

If all UDF arguments are scalars, so should be the result. In most cases, such function calls will be contant-folded, however, if for whatever reason the are not optimized, we want to avoid an error due to array length mismatch.

Which issue does this PR close?

No particular issue, as I couldn't find a case that is not constant-folded with pure DataFusion, however we have some custom optimizer rules that might result in expressions that are not constant-folded.

Rationale for this change

The rationale is that the UDF implementations shouldn't rely on optimizer rules to work correctly.

What changes are included in this PR?

Added a new helper function ColumnarValue::from_args_and_result to perform the check and conversion.
Updated the math UDF macros to return a scalar value when all their arguments are scalar.

Are these changes tested?

Relying on existing tests to ensure nothing breaks.

Are there any user-facing changes?

I made ColumnarValue::from_args_and_result public but if that's not desirable, I can move it to the math module.

If all UDF arguments are scalars, so should be the result. In most cases, such function calls will be contant-folded, however if for whatever reason the are not optimized, we want to avoid an error due to array length mismatch.

alamb

❤️

alamb · 2024-10-14T17:22:08Z

datafusion/functions/src/macros.rs

-                fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
-                    let args = ColumnarValue::values_to_arrays(args)?;
-
+                fn invoke(&self, col_args: &[ColumnarValue]) -> Result<ColumnarValue> {


Note that @simonvandel and @jonahgao have been removing use of make_math_unary_udf in general

(or example #12889)

We should probably make sure we aren't missing a similar problem in their reviews

Is it necessary to add an attribute like maintains_scalar to ScalarUDFImpl and then automatically enforce this process during evaluation?

I guess it's not necessarily true for volatile functions. But I wonder then how will the function know the array length? E.g. for a rand(max) function if I call it like rand(42) which is pretty reasonable, how does it work?

If a function needs to maintain scalar, we can call ColumnarValue::from_args_and_result after invoking it.

// in file: datafusion/physical-expr/src/scalar_function.rs // evaluate the function let output = match self.args.is_empty() { true => self.fun.invoke_no_args(batch.num_rows()), false => self.fun.invoke(&inputs), }?; if let ColumnarValue::Array(array) = &output { if self.fun.need_maintain_scalar() && !inputs.is_empty() { output = ColumnarValue::from_args_and_result(inputs, array) } }

I mean that if the function is volatile but it has only scalar arguments, how does it know how many rows to produce?

Yeah anyway, I was just trying to show that this is a deficiency of the UDF model. Ideally, invoke should also get the number of expected rows and then there would also be no need for a special invoke_no_args.

Regarding adding a flag with default false, that doesn't sound very useful. Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.

Ideally, invoke should also get the number of expected rows and then there would also be no need for a special invoke_no_args.

Make sense to me. This makes it easy to support rand_int. It seems that currently all functions should return a scalar value when all their arguments are scalar, for example, math functions that do not use make_math_*_udf.

Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.

This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.

@joroKr21 @alamb Filed a new issue for other functions #12959

This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.

Yeah, but it sounds very unlikely that we will have a batch of one row. In any case, we can do the same check that all arguments are scalar 👍

Yeah, but it sounds very unlikely that we will have a batch of one row.

it's unlikely, but you must not assume that.

a trivial example is a source table with one row.

we can do the same check that all arguments are scalar 👍

👍

findepi · 2024-10-15T14:47:15Z

datafusion/expr-common/src/columnar_value.rs

+    /// Converts an [`ArrayRef`] to a [`ColumnarValue`] based on the supplied arguments.
+    /// This is useful for scalar UDF implementations to fulfil their contract:
+    /// if all arguments are scalar values, the result should also be a scalar value.
+    pub fn from_args_and_result(args: &[Self], result: ArrayRef) -> Result<Self> {
+        if result.len() == 1 && args.iter().all(|arg| matches!(arg, Self::Scalar(_))) {
+            Ok(Self::Scalar(ScalarValue::try_from_array(&result, 0)?))
+        } else {
+            Ok(Self::Array(result))


If we allow such code to exist, wouldn't it be better to just have it during constant-folding, rather than inside function implementations?

It already exists in constant folding, but we shouldn't rely on constant folding for the correct function implementation.

That depends how we define "correct".
IF constant folding is the only party where maintaining ColumnarValue::Scalar matters, THEN we can redefine "correct", loosening the contract.

It's not the only place where it matters. If for some reason the expression is not constant folded, you will get an error that the array lengths of different columns don't match.

jonahgao · 2024-10-16T01:58:41Z

Thanks @joroKr21 Let's merge it first, and then consider similar problem for other scalar functions.

This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

This was done in #12922 only for math functions. We now generalize this fallback to all scalar UDFs.

…2965) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

github-actions bot added the functions Changes to functions implementation label Oct 14, 2024

joroKr21 mentioned this pull request Oct 14, 2024

Ensure that math functions fulfil the ColumnarValue contract coralogix/arrow-datafusion#275

Merged

thinkharderdev approved these changes Oct 14, 2024

View reviewed changes

alamb reviewed Oct 14, 2024

View reviewed changes

findepi reviewed Oct 15, 2024

View reviewed changes

jonahgao merged commit 399e840 into apache:main Oct 16, 2024
25 checks passed

jonahgao mentioned this pull request Oct 16, 2024

Ensure that scalar functions fulfil the ColumnarValue contract #12959

Closed

joroKr21 deleted the math-udfs/scalar-value branch October 16, 2024 06:05

joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 16, 2024

Handle one-element array return value in ScalarFunctionExpr

f58bd6e

This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

joroKr21 mentioned this pull request Oct 16, 2024

Handle one-element array return value in ScalarFunctionExpr #12965

Merged

joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 16, 2024

Handle one-element array return value in ScalarFunctionExpr

51e07d7

This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 17, 2024

Handle one-element array return value in ScalarFunctionExpr

ee3cfdc

This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.

alamb pushed a commit that referenced this pull request Oct 17, 2024

Handle one-element array return value in ScalarFunctionExpr (#12965)

0ed369e

This was done in #12922 only for math functions. We now generalize this fallback to all scalar UDFs.

joroKr21 mentioned this pull request Oct 17, 2024

Handle one-element array return value in ScalarFunctionExpr coralogix/arrow-datafusion#276

Merged

findepi mentioned this pull request Oct 22, 2024

Fix functions with Volatility::Volatile and parameters #13001

Merged

Ensure that math functions fulfil the ColumnarValue contract #12922

Ensure that math functions fulfil the ColumnarValue contract #12922

Uh oh!

Conversation

joroKr21 commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonahgao Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joroKr21 Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonahgao commented Oct 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joroKr21 commented Oct 14, 2024 •

edited

Loading

jonahgao Oct 16, 2024 •

edited

Loading

joroKr21 Oct 16, 2024 •

edited

Loading