-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343
Conversation
@@ -45,14 +45,14 @@ | |||
/// | |||
/// [`ScalarUDFImpl`]: datafusion_expr::ScalarUDFImpl | |||
macro_rules! make_udf_function { | |||
($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => { | |||
($UDF:ty, $EXPR_FN:ident, $arg:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change to vec? previous syntax is able to parse arbitrary args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For function with an indefinite number of args, like ArrayConcat
or MakeArray
, previous syntax unable to handle well,
And we don't care about args name, ScalarFunction::args just need an Vec, So I think change to vec is reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Can we have another macro handle arbitrary args function?
like what we have before
macro_rules! scalar_expr {
($ENUM:ident, $FUNC:ident, $($arg:ident)*, $DOC:expr) => {
#[doc = $DOC]
pub fn $FUNC($($arg: Expr),*) -> Expr {
Expr::ScalarFunction(ScalarFunction::new(
built_in_function::BuiltinScalarFunction::$ENUM,
vec![$($arg),*],
))
}
};
}
macro_rules! nary_scalar_expr {
($ENUM:ident, $FUNC:ident, $DOC:expr) => {
#[doc = $DOC ]
pub fn $FUNC(args: Vec<Expr>) -> Expr {
Expr::ScalarFunction(ScalarFunction::new(
built_in_function::BuiltinScalarFunction::$ENUM,
args,
))
}
};
}
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid wrapping single element into vec![], which is quite nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this, I am tested, it work well :
macro_rules! make_udf_function {
($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => {
paste::paste! {
// "fluent expr_fn" style function
#[doc = $DOC]
pub fn $EXPR_FN($($arg: Expr),*) -> Expr {
Expr::ScalarFunction(ScalarFunction::new_udf(
$SCALAR_UDF_FN(),
vec![$($arg),*],
))
}
/// Singleton instance of [`$UDF`], ensures the UDF is only created once
/// named STATIC_$(UDF). For example `STATIC_ArrayToString`
#[allow(non_upper_case_globals)]
static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
std::sync::OnceLock::new();
/// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
///
/// [`ScalarUDF`]: datafusion_expr::ScalarUDF
pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
[< STATIC_ $UDF >]
.get_or_init(|| {
std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
<$UDF>::new(),
))
})
.clone()
}
}
};
($UDF:ty, $EXPR_FN:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => {
paste::paste! {
// "fluent expr_fn" style function
#[doc = $DOC]
pub fn $EXPR_FN(arg: Vec<Expr>) -> Expr {
Expr::ScalarFunction(ScalarFunction::new_udf(
$SCALAR_UDF_FN(),
arg,
))
}
/// Singleton instance of [`$UDF`], ensures the UDF is only created once
/// named STATIC_$(UDF). For example `STATIC_ArrayToString`
#[allow(non_upper_case_globals)]
static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
std::sync::OnceLock::new();
/// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
///
/// [`ScalarUDF`]: datafusion_expr::ScalarUDF
pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
[< STATIC_ $UDF >]
.get_or_init(|| {
std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
<$UDF>::new(),
))
})
.clone()
}
}
};
}
} | ||
|
||
fn invoke(&self, args: &[ColumnarValue]) -> datafusion_common::Result<ColumnarValue> { | ||
make_scalar_function_with_hints(crate::kernels::array_append)(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont need this. I play around it and find that there is error in
LargeList casting panics
If the list casting is fixed, probably we dont need make_scalar_function_with_hints
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if no make_scalar_function_with_hints
function, arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs will pannic, cause by arrow-rs not support cast LargeList, So At present, we still need make_scalar_function_with_hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use as_list::<i64>
for large list, and as_list::<i32>
for list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think make_scalar_function_with_hints
function have clear logic, determine output is Scalar or Array, I want
to keep it 🤔
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is any case that we need Scalar here. If so, we definitely can keep make_scalar_function_with_hints
, otherwise, we can add it back until we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @alamb can give some advise 😄
@jayzhan211 thank you very much for your review 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @guojidan and @jayzhan211 for the review. I am only concerned about the newly introduced dependnecy between datafusion-optimizer and datafusion-functions-array.... I wonder if we can find a way to avoid that dependency (maybe we could move the array specific rewrites into their own pass somehow 🤔 )
@@ -578,7 +578,14 @@ async fn roundtrip_expr_api() -> Result<()> { | |||
let expr_list = vec![ | |||
encode(col("a").cast_to(&DataType::Utf8, &schema)?, lit("hex")), | |||
decode(lit("1234"), lit("hex")), | |||
array_to_string(array(vec![lit(1), lit(2), lit(3)]), lit(",")), | |||
array_to_string(make_array(vec![lit(1), lit(2), lit(3)]), lit(",")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
datafusion/optimizer/Cargo.toml
Outdated
@@ -44,6 +44,7 @@ async-trait = { workspace = true } | |||
chrono = { workspace = true } | |||
datafusion-common = { workspace = true } | |||
datafusion-expr = { workspace = true } | |||
datafusion-functions-array = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this new dependency -- it means that using the optimizer will require bringing in the physical exprs, etc and users (like dask-sql) that only want the planner code would bring in substantial unused code.
I realize the reason this is required is to preserve the semantics of the existing rewrite pass in datafusion/optimizer/src/analyzer/rewrite_expr.rs
. I wonder if we can somehow avoid adding this new dependency 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can split udf to logical expr and physcial expr, and export only logical expr conditionally (with #[cfg(feature = "array_expressions")]
) to optimizer, so we can do optimizing for udfs, but only import the necessary one for user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid importing function-arrays
if we get udf via self.context_provider.get_function_meta(&name)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid importing
function-arrays
if we get udf viaself.context_provider.get_function_meta(&name)
?
wow, This may be feasible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jayzhan211 , Do you think my current approach to the analyzer is feasible? I move array analyzer to datafusion-functions-array
crate, add array analyzer to analyzer rules
if array_expression
featrue is enable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a good idea, it seems like duplicated code without any good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guojidan
let's see if it is possible to optionally import datafusion-functions-array
, it seems an easier way than bringing udfs to optimizer.
hi @jayzhan211 , if |
I think it depends on whether we should place |
yes, I think |
#9100 (comment) |
as you said, we can use |
I converted it to the draft, feel free to ping me when the PR is ready to review |
Remember to replace them in rewriter too |
hi @jayzhan211 , sorry long time to reply, this pr's todo list is:
|
Nope |
difficult to implement, |
because this pr lasting for too long,difficult to rebase or merge, so I open a new pr #9504 , I will close this pr |
Which issue does this PR close?
Closes #9322 .
Rationale for this change
What changes are included in this PR?
move make_array array_append array_prepend array_concat function to datafusion-functions-array crate
Are these changes tested?
yes
Are there any user-facing changes?
no