Skip to content

Conversation

@joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Oct 14, 2024

If all UDF arguments are scalars, so should be the result. In most cases, such function calls will be contant-folded, however, if for whatever reason the are not optimized, we want to avoid an error due to array length mismatch.

Which issue does this PR close?

No particular issue, as I couldn't find a case that is not constant-folded with pure DataFusion, however we have some custom optimizer rules that might result in expressions that are not constant-folded.

Rationale for this change

The rationale is that the UDF implementations shouldn't rely on optimizer rules to work correctly.

What changes are included in this PR?

  • Added a new helper function ColumnarValue::from_args_and_result to perform the check and conversion.
  • Updated the math UDF macros to return a scalar value when all their arguments are scalar.

Are these changes tested?

Relying on existing tests to ensure nothing breaks.

Are there any user-facing changes?

I made ColumnarValue::from_args_and_result public but if that's not desirable, I can move it to the math module.

If all UDF arguments are scalars, so should be the result.
In most cases, such function calls will be contant-folded,
however if for whatever reason the are not optimized,
we want to avoid an error due to array length mismatch.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
let args = ColumnarValue::values_to_arrays(args)?;

fn invoke(&self, col_args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that @simonvandel and @jonahgao have been removing use of make_math_unary_udf in general

(or example #12889)

We should probably make sure we aren't missing a similar problem in their reviews

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add an attribute like maintains_scalar to ScalarUDFImpl and then automatically enforce this process during evaluation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not necessarily true for volatile functions. But I wonder then how will the function know the array length? E.g. for a rand(max) function if I call it like rand(42) which is pretty reasonable, how does it work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a function needs to maintain scalar, we can call ColumnarValue::from_args_and_result after invoking it.

        // in file: datafusion/physical-expr/src/scalar_function.rs
        // evaluate the function
        let output = match self.args.is_empty() {
            true => self.fun.invoke_no_args(batch.num_rows()),
            false => self.fun.invoke(&inputs),
        }?;

        if let ColumnarValue::Array(array) = &output {
            if self.fun.need_maintain_scalar() && !inputs.is_empty() {
                output = ColumnarValue::from_args_and_result(inputs, array)
            }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that if the function is volatile but it has only scalar arguments, how does it know how many rows to produce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah anyway, I was just trying to show that this is a deficiency of the UDF model. Ideally, invoke should also get the number of expected rows and then there would also be no need for a special invoke_no_args.

Regarding adding a flag with default false, that doesn't sound very useful. Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.

Copy link
Member

@jonahgao jonahgao Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, invoke should also get the number of expected rows and then there would also be no need for a special invoke_no_args.

Make sense to me. This makes it easy to support rand_int. It seems that currently all functions should return a scalar value when all their arguments are scalar, for example, math functions that do not use make_math_*_udf.

Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.

This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joroKr21 @alamb Filed a new issue for other functions #12959

Copy link
Contributor Author

@joroKr21 joroKr21 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.

Yeah, but it sounds very unlikely that we will have a batch of one row. In any case, we can do the same check that all arguments are scalar 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but it sounds very unlikely that we will have a batch of one row.

it's unlikely, but you must not assume that.

a trivial example is a source table with one row.

we can do the same check that all arguments are scalar 👍

👍

Comment on lines +221 to +228
/// Converts an [`ArrayRef`] to a [`ColumnarValue`] based on the supplied arguments.
/// This is useful for scalar UDF implementations to fulfil their contract:
/// if all arguments are scalar values, the result should also be a scalar value.
pub fn from_args_and_result(args: &[Self], result: ArrayRef) -> Result<Self> {
if result.len() == 1 && args.iter().all(|arg| matches!(arg, Self::Scalar(_))) {
Ok(Self::Scalar(ScalarValue::try_from_array(&result, 0)?))
} else {
Ok(Self::Array(result))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow such code to exist, wouldn't it be better to just have it during constant-folding, rather than inside function implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already exists in constant folding, but we shouldn't rely on constant folding for the correct function implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends how we define "correct".
IF constant folding is the only party where maintaining ColumnarValue::Scalar matters, THEN we can redefine "correct", loosening the contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the only place where it matters. If for some reason the expression is not constant folded, you will get an error that the array lengths of different columns don't match.

@jonahgao
Copy link
Member

Thanks @joroKr21 Let's merge it first, and then consider similar problem for other scalar functions.

@jonahgao jonahgao merged commit 399e840 into apache:main Oct 16, 2024
25 checks passed
@joroKr21 joroKr21 deleted the math-udfs/scalar-value branch October 16, 2024 06:05
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 16, 2024
This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 16, 2024
This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 17, 2024
This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
alamb pushed a commit that referenced this pull request Oct 17, 2024
This was done in #12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 17, 2024
…2965)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Oct 19, 2024
…2965) (#276)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
avantgardnerio pushed a commit to coralogix/arrow-datafusion that referenced this pull request Sep 17, 2025
…2965) (#276)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
avantgardnerio pushed a commit to coralogix/arrow-datafusion that referenced this pull request Sep 18, 2025
…2965) (#276)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
avantgardnerio pushed a commit to coralogix/arrow-datafusion that referenced this pull request Sep 18, 2025
…2965) (#276)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
avantgardnerio pushed a commit to coralogix/arrow-datafusion that referenced this pull request Sep 18, 2025
…2965) (#276)

This was done in apache#12922 only for math functions.
We now generalize this fallback to all scalar UDFs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants