-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix functions with Volatility::Volatile and parameters #13001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ff4b90c to
4c8ebc4
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @agscpp -- this makes a lot of sense and I think is quite close. I left some API comments for your consideration. Let me know!
datafusion/expr/src/udf.rs
Outdated
| not_impl_err!( | ||
| "Function {} does not implement invoke_batch but called", | ||
| self.name() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 It seems like the ideal outcome would be for all ScalarUDFs to implement this method (as it covers invoke, invoke_no_args as well).
Would you be open to changing this so it uses a default implementation like this?
| not_impl_err!( | |
| "Function {} does not implement invoke_batch but called", | |
| self.name() | |
| ) | |
| if _args.empty() { | |
| self.invoke_no_args(number_rows) | |
| } else { | |
| self.invoke(args) | |
| } |
Then the function implementation could decide what to do with that information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| "+-----------+", // | ||
| "| str |", // | ||
| "+-----------+", // | ||
| "| 1. test_1 |", // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning of the trailing // ?
Also, it seems like the indexes repeat (multiple with 1) imply invoke is run multiple times - perhaps we could set target_partitions to 1 on the SessionContext so the data wasn't repartitioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I myself did not understand what the // are for, but I left it because this style was used above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the invoke_batch method is called once. In the logic of the function itself, I wrote the indexing of the module by the contents of the lines.
1. test_1
2. test_1
3. test_1
4. test_1
1. test_2
2. test_2
3. test_2
datafusion/core/tests/user_defined/user_defined_scalar_functions.rs
Outdated
Show resolved
Hide resolved
| let output = match self.args.is_empty() { | ||
| true => self.fun.invoke_no_args(batch.num_rows()), | ||
| false => self.fun.invoke(&inputs), | ||
| false => match self.fun.signature().volatility { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you modified invoke_batch as above, we could change this code to simply call self.fun.invoke_batch() always
4c8ebc4 to
1e6ad19
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @agscpp -- this looks great to me
| /// | ||
| /// [invoke_no_args]: ScalarUDFImpl::invoke_no_args | ||
| fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue>; | ||
| fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! -- as a follow on PR I think we should deprecate the other two functions (invoke_no_args and invoke) telling people to use invoke instead
Is this ok with you @jayzhan211 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a follow on PR I think we should deprecate the other two functions (
invoke_no_argsandinvoke) telling people to useinvokeinstead
did you mean invoke_batch?
yes, it would be great to have only one invoke entry-point
| assert_batches_eq!(expected, &result); | ||
|
|
||
| let result = | ||
| plan_and_collect(&ctx, "select add_index_to_string('test') AS str from t") // with fixed function parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
| /// | ||
| /// [invoke_no_args]: ScalarUDFImpl::invoke_no_args | ||
| fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue>; | ||
| fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a follow on PR I think we should deprecate the other two functions (
invoke_no_argsandinvoke) telling people to useinvokeinstead
did you mean invoke_batch?
yes, it would be great to have only one invoke entry-point
datafusion/expr/src/udf.rs
Outdated
| _args: &[ColumnarValue], | ||
| _number_rows: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not unused, let's remove leading _ from arg names
datafusion/expr/src/udf.rs
Outdated
| /// The function should be used for signatures with [`datafusion_expr_common::signature::Volatility::Volatile`] | ||
| /// and with arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for these?
if yes => the function should be named appropriate invoke_volatile)
if no => remove the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this comment. This function is always called in the current implementation.
| } | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: consider separating test cases into separate test functions, this would given them descriptive names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are very similar, so they don't break down into two parts well.
| /// Volatile UDF that should be append a different value to each row | ||
| struct AddIndexToStringScalarUDF { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The volatility is important, let's reflect it in the function name -- it's this function's main purpose, not an attribute it happens to have.
AddIndexToStringVolatileScalarUDF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
| Ok(()) | ||
| } | ||
|
|
||
| /// Volatile UDF that should be append a different value to each row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Volatile UDF that should be append a different value to each row | |
| /// Volatile UDF that should append a different value to each row | |
| #[derive(Debug)] |
| impl std::fmt::Debug for AddIndexToStringScalarUDF { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { | ||
| f.debug_struct("ScalarUDF") | ||
| .field("name", &self.name) | ||
| .field("signature", &self.signature) | ||
| .field("fun", &"<FUNC>") | ||
| .finish() | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leverage #[derive(Debug)], to ensure all fields are part of debug.
| impl std::fmt::Debug for AddIndexToStringScalarUDF { | |
| fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { | |
| f.debug_struct("ScalarUDF") | |
| .field("name", &self.name) | |
| .field("signature", &self.signature) | |
| .field("fun", &"<FUNC>") | |
| .finish() | |
| } | |
| } |
| _ => unimplemented!(), | ||
| }; | ||
| Ok(ColumnarValue::Array( | ||
| Arc::new(StringArray::from(answer)) as ArrayRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Arc::new(StringArray::from(answer)) as ArrayRef | |
| Arc::new(StringArray::from(answer)) |
| } | ||
| _ => unimplemented!(), | ||
| }; | ||
| Ok(ColumnarValue::Array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK for this function to return array also when it's invoked with ColumnarValue::Scalar only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine. If you return a ColumnarValue::Scalar, all rows will have the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about #12922
1e6ad19 to
2c68fc9
Compare
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Co-authored-by: Agaev Huseyn <h.agaev@vkteam.ru>
Which issue does this PR close?
Closes #13000.
Rationale for this change
These changes will allow functions to be implemented that produce a unique result for each call given the same input.