-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Ensure that math functions fulfil the ColumnarValue contract #12922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If all UDF arguments are scalars, so should be the result. In most cases, such function calls will be contant-folded, however if for whatever reason the are not optimized, we want to avoid an error due to array length mismatch.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
| let args = ColumnarValue::values_to_arrays(args)?; | ||
|
|
||
| fn invoke(&self, col_args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that @simonvandel and @jonahgao have been removing use of make_math_unary_udf in general
(or example #12889)
We should probably make sure we aren't missing a similar problem in their reviews
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add an attribute like maintains_scalar to ScalarUDFImpl and then automatically enforce this process during evaluation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not necessarily true for volatile functions. But I wonder then how will the function know the array length? E.g. for a rand(max) function if I call it like rand(42) which is pretty reasonable, how does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a function needs to maintain scalar, we can call ColumnarValue::from_args_and_result after invoking it.
// in file: datafusion/physical-expr/src/scalar_function.rs
// evaluate the function
let output = match self.args.is_empty() {
true => self.fun.invoke_no_args(batch.num_rows()),
false => self.fun.invoke(&inputs),
}?;
if let ColumnarValue::Array(array) = &output {
if self.fun.need_maintain_scalar() && !inputs.is_empty() {
output = ColumnarValue::from_args_and_result(inputs, array)
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that if the function is volatile but it has only scalar arguments, how does it know how many rows to produce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah anyway, I was just trying to show that this is a deficiency of the UDF model. Ideally, invoke should also get the number of expected rows and then there would also be no need for a special invoke_no_args.
Regarding adding a flag with default false, that doesn't sound very useful. Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally,
invokeshould also get the number of expected rows and then there would also be no need for a specialinvoke_no_args.
Make sense to me. This makes it easy to support rand_int. It seems that currently all functions should return a scalar value when all their arguments are scalar, for example, math functions that do not use make_math_*_udf.
Maybe we can just assume that if the result of any immutable function is an array of size one, we might as well convert it to a scalar value.
This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a general solution. The disadvantage is that if the arguments are not real scalar, it will cause unnecessary overhead, as the final result of queries will still need to be converted back into arrays.
Yeah, but it sounds very unlikely that we will have a batch of one row. In any case, we can do the same check that all arguments are scalar 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but it sounds very unlikely that we will have a batch of one row.
it's unlikely, but you must not assume that.
a trivial example is a source table with one row.
we can do the same check that all arguments are scalar 👍
👍
| /// Converts an [`ArrayRef`] to a [`ColumnarValue`] based on the supplied arguments. | ||
| /// This is useful for scalar UDF implementations to fulfil their contract: | ||
| /// if all arguments are scalar values, the result should also be a scalar value. | ||
| pub fn from_args_and_result(args: &[Self], result: ArrayRef) -> Result<Self> { | ||
| if result.len() == 1 && args.iter().all(|arg| matches!(arg, Self::Scalar(_))) { | ||
| Ok(Self::Scalar(ScalarValue::try_from_array(&result, 0)?)) | ||
| } else { | ||
| Ok(Self::Array(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow such code to exist, wouldn't it be better to just have it during constant-folding, rather than inside function implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already exists in constant folding, but we shouldn't rely on constant folding for the correct function implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends how we define "correct".
IF constant folding is the only party where maintaining ColumnarValue::Scalar matters, THEN we can redefine "correct", loosening the contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the only place where it matters. If for some reason the expression is not constant folded, you will get an error that the array lengths of different columns don't match.
|
Thanks @joroKr21 Let's merge it first, and then consider similar problem for other scalar functions. |
This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
This was done in #12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
…2965) (#276) This was done in apache#12922 only for math functions. We now generalize this fallback to all scalar UDFs.
If all UDF arguments are scalars, so should be the result. In most cases, such function calls will be contant-folded, however, if for whatever reason the are not optimized, we want to avoid an error due to array length mismatch.
Which issue does this PR close?
No particular issue, as I couldn't find a case that is not constant-folded with pure DataFusion, however we have some custom optimizer rules that might result in expressions that are not constant-folded.
Rationale for this change
The rationale is that the UDF implementations shouldn't rely on optimizer rules to work correctly.
What changes are included in this PR?
ColumnarValue::from_args_and_resultto perform the check and conversion.Are these changes tested?
Relying on existing tests to ensure nothing breaks.
Are there any user-facing changes?
I made
ColumnarValue::from_args_and_resultpublic but if that's not desirable, I can move it to the math module.