move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

guojidan · 2024-02-26T09:11:41Z

Which issue does this PR close?

Closes #9322 .

Rationale for this change

What changes are included in this PR?

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate

Are these changes tested?

yes

Are there any user-facing changes?

no

datafusion/physical-expr/src/functions.rs

jayzhan211 · 2024-02-27T13:30:20Z

datafusion/functions-array/src/macros.rs

@@ -45,14 +45,14 @@
 ///
 /// [`ScalarUDFImpl`]: datafusion_expr::ScalarUDFImpl
 macro_rules! make_udf_function {
-    ($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => {
+    ($UDF:ty, $EXPR_FN:ident, $arg:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => {


why is this change to vec? previous syntax is able to parse arbitrary args

For function with an indefinite number of args, like ArrayConcat or MakeArray, previous syntax unable to handle well,
And we don't care about args name, ScalarFunction::args just need an Vec, So I think change to vec is reasonable

I see.

Can we have another macro handle arbitrary args function?

like what we have before

macro_rules! scalar_expr { ($ENUM:ident, $FUNC:ident, $($arg:ident)*, $DOC:expr) => { #[doc = $DOC] pub fn $FUNC($($arg: Expr),*) -> Expr { Expr::ScalarFunction(ScalarFunction::new( built_in_function::BuiltinScalarFunction::$ENUM, vec![$($arg),*], )) } }; } macro_rules! nary_scalar_expr { ($ENUM:ident, $FUNC:ident, $DOC:expr) => { #[doc = $DOC ] pub fn $FUNC(args: Vec<Expr>) -> Expr { Expr::ScalarFunction(ScalarFunction::new( built_in_function::BuiltinScalarFunction::$ENUM, args, )) } }; }

I think we can avoid wrapping single element into vec![], which is quite nice

how about this, I am tested, it work well :

macro_rules! make_udf_function { ($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => { paste::paste! { // "fluent expr_fn" style function #[doc = $DOC] pub fn $EXPR_FN($($arg: Expr),*) -> Expr { Expr::ScalarFunction(ScalarFunction::new_udf( $SCALAR_UDF_FN(), vec![$($arg),*], )) } /// Singleton instance of [`$UDF`], ensures the UDF is only created once /// named STATIC_$(UDF). For example `STATIC_ArrayToString` #[allow(non_upper_case_globals)] static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> = std::sync::OnceLock::new(); /// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`] /// /// [`ScalarUDF`]: datafusion_expr::ScalarUDF pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> { [< STATIC_ $UDF >] .get_or_init(|| { std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl( <$UDF>::new(), )) }) .clone() } } }; ($UDF:ty, $EXPR_FN:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => { paste::paste! { // "fluent expr_fn" style function #[doc = $DOC] pub fn $EXPR_FN(arg: Vec<Expr>) -> Expr { Expr::ScalarFunction(ScalarFunction::new_udf( $SCALAR_UDF_FN(), arg, )) } /// Singleton instance of [`$UDF`], ensures the UDF is only created once /// named STATIC_$(UDF). For example `STATIC_ArrayToString` #[allow(non_upper_case_globals)] static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> = std::sync::OnceLock::new(); /// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`] /// /// [`ScalarUDF`]: datafusion_expr::ScalarUDF pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> { [< STATIC_ $UDF >] .get_or_init(|| { std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl( <$UDF>::new(), )) }) .clone() } } }; }

datafusion/optimizer/src/analyzer/rewrite_expr.rs

datafusion/proto/tests/cases/roundtrip_logical_plan.rs

datafusion/functions-array/src/udf.rs

jayzhan211 · 2024-02-28T09:27:01Z

datafusion/functions-array/src/udf.rs

+    }
+
+    fn invoke(&self, args: &[ColumnarValue]) -> datafusion_common::Result<ColumnarValue> {
+        make_scalar_function_with_hints(crate::kernels::array_append)(args)


I think we dont need this. I play around it and find that there is error in

https://github.com/apache/arrow-datafusion/blob/32d906fc9622af3a67b3828700272092fe0982a0/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L526-L530

LargeList casting panics

If the list casting is fixed, probably we dont need make_scalar_function_with_hints anymore.

yes, if no make_scalar_function_with_hints function, arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs will pannic, cause by arrow-rs not support cast LargeList, So At present, we still need make_scalar_function_with_hints

we can use as_list::<i64> for large list, and as_list::<i32> for list

but I think make_scalar_function_with_hints function have clear logic, determine output is Scalar or Array, I want
to keep it 🤔

I'm not sure if there is any case that we need Scalar here. If so, we definitely can keep make_scalar_function_with_hints , otherwise, we can add it back until we need it.

maybe @alamb can give some advise 😄

guojidan · 2024-02-28T15:02:24Z

@jayzhan211 thank you very much for your review 😄

alamb

Thank you @guojidan and @jayzhan211 for the review. I am only concerned about the newly introduced dependnecy between datafusion-optimizer and datafusion-functions-array.... I wonder if we can find a way to avoid that dependency (maybe we could move the array specific rewrites into their own pass somehow 🤔 )

alamb · 2024-02-28T18:55:23Z

datafusion/proto/tests/cases/roundtrip_logical_plan.rs

@@ -578,7 +578,14 @@ async fn roundtrip_expr_api() -> Result<()> {
    let expr_list = vec![
        encode(col("a").cast_to(&DataType::Utf8, &schema)?, lit("hex")),
        decode(lit("1234"), lit("hex")),
-        array_to_string(array(vec![lit(1), lit(2), lit(3)]), lit(",")),
+        array_to_string(make_array(vec![lit(1), lit(2), lit(3)]), lit(",")),


alamb · 2024-02-28T18:57:32Z

datafusion/optimizer/Cargo.toml

@@ -44,6 +44,7 @@ async-trait = { workspace = true }
 chrono = { workspace = true }
 datafusion-common = { workspace = true }
 datafusion-expr = { workspace = true }
+datafusion-functions-array = { workspace = true }


I am not sure about this new dependency -- it means that using the optimizer will require bringing in the physical exprs, etc and users (like dask-sql) that only want the planner code would bring in substantial unused code.

I realize the reason this is required is to preserve the semantics of the existing rewrite pass in datafusion/optimizer/src/analyzer/rewrite_expr.rs. I wonder if we can somehow avoid adding this new dependency 🤔

Maybe we can split udf to logical expr and physcial expr, and export only logical expr conditionally (with #[cfg(feature = "array_expressions")]) to optimizer, so we can do optimizing for udfs, but only import the necessary one for user.

Is it possible to avoid importing function-arrays if we get udf via self.context_provider.get_function_meta(&name) ?

Is it possible to avoid importing function-arrays if we get udf via self.context_provider.get_function_meta(&name) ?

wow, This may be feasible

hi @jayzhan211 , Do you think my current approach to the analyzer is feasible? I move array analyzer to datafusion-functions-array crate, add array analyzer to analyzer rules if array_expression featrue is enable

Probably not a good idea, it seems like duplicated code without any good reason.

@guojidan
let's see if it is possible to optionally import datafusion-functions-array, it seems an easier way than bringing udfs to optimizer.

guojidan · 2024-02-29T07:13:34Z

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

jayzhan211 · 2024-02-29T07:20:06Z

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category.
if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

guojidan · 2024-02-29T07:29:59Z

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

jayzhan211 · 2024-02-29T07:37:53Z

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

#9100 (comment)
Edit: Let's keep it in function-arrays first, unless there is any good reason

guojidan · 2024-02-29T08:05:26Z

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

#9100 (comment) Edit: Let's keep it in function-arrays first, unless there is any good reason

as you said, we can use self.context_provider.get_function_meta("make_array") replace

jayzhan211 · 2024-03-02T05:23:20Z

I converted it to the draft, feel free to ping me when the PR is ready to review

jayzhan211 · 2024-03-02T05:24:24Z

#9100 (comment) Edit: Let's keep it in function-arrays first, unless there is any good reason

as you said, we can use self.context_provider.get_function_meta("make_array") replace

Remember to replace them in rewriter too

guojidan · 2024-03-06T09:36:00Z

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

don't move rewriter_expr.rs file into datafusion-functions-array crate
use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
Is there anything else to add？

jayzhan211 · 2024-03-06T09:43:57Z

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

don't move rewriter_expr.rs file into datafusion-functions-array crate

use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
Is there anything else to add？

Nope

guojidan · 2024-03-07T08:55:11Z

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

don't move rewriter_expr.rs file into datafusion-functions-array crate

use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
Is there anything else to add？

difficult to implement, Analyzer struct have not dyn ContextProvider member, and I don't known how add dyn ContextProvider into Analyzer, can you give me some tips? @jayzhan211

guojidan · 2024-03-08T10:08:58Z

because this pr lasting for too long，difficult to rebase or merge, so I open a new pr #9504 , I will close this pr

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Feb 26, 2024

guojidan force-pushed the move-array branch from aadb7e5 to 44ee67f Compare February 27, 2024 02:12

guojidan mentioned this pull request Feb 27, 2024

[Epic] Port BuiltInFunctons to datafusion-functions-* crates #9285

Closed

43 tasks

guojidan force-pushed the move-array branch from ed0a106 to 32cc0a5 Compare February 27, 2024 12:50

jayzhan211 reviewed Feb 27, 2024

View reviewed changes

datafusion/proto/tests/cases/roundtrip_logical_plan.rs Outdated Show resolved Hide resolved

jayzhan211 reviewed Feb 27, 2024

View reviewed changes

datafusion/functions-array/src/udf.rs Outdated Show resolved Hide resolved

guojidan force-pushed the move-array branch from 32cc0a5 to 8f16593 Compare February 28, 2024 09:20

jayzhan211 reviewed Feb 28, 2024

View reviewed changes

github-actions bot removed the core Core DataFusion crate label Feb 28, 2024

guojidan added 8 commits February 28, 2024 11:31

move array function

cdb1bf1

fix proto file

3db0b66

regen proto

c5afab9

fix cli Cargo.lock

ad55448

cargo fmt

17d4893

fix some logical && add test case

c9d3b20

fix rebase err

1834bde

optimize macros

c09e656

guojidan force-pushed the move-array branch from 1ed2488 to c09e656 Compare February 28, 2024 11:32

alamb mentioned this pull request Feb 28, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 26, 2024 #9345

Closed

9 tasks

alamb reviewed Feb 28, 2024

View reviewed changes

Separate the optimizer for arrays

87d9839

github-actions bot added the core Core DataFusion crate label Feb 29, 2024

guojidan added 2 commits February 29, 2024 03:29

Merge remote-tracking branch 'origin/main' into move-array

c6fde09

fix merge err

1671dac

fmt && clippy

c11ad0e

fix circular dependency && featrue

a1f91b6

jayzhan211 marked this pull request as draft March 2, 2024 05:21

guojidan mentioned this pull request Mar 8, 2024

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9504

Merged

guojidan closed this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

guojidan commented Feb 26, 2024

jayzhan211 Feb 27, 2024

guojidan Feb 28, 2024

jayzhan211 Feb 28, 2024

This comment was marked as outdated.

jayzhan211 Feb 28, 2024

guojidan Feb 28, 2024

jayzhan211 Feb 28, 2024

guojidan Feb 28, 2024

jayzhan211 Feb 28, 2024 •

edited

Loading

guojidan Feb 28, 2024

This comment was marked as outdated.

jayzhan211 Feb 28, 2024

guojidan Feb 28, 2024

guojidan commented Feb 28, 2024

alamb left a comment

alamb Feb 28, 2024

alamb Feb 28, 2024

jayzhan211 Feb 29, 2024

jayzhan211 Feb 29, 2024

guojidan Feb 29, 2024

guojidan Feb 29, 2024

jayzhan211 Feb 29, 2024

jayzhan211 Mar 7, 2024

guojidan commented Feb 29, 2024 •

edited by jayzhan211

Loading

jayzhan211 commented Feb 29, 2024

guojidan commented Feb 29, 2024

jayzhan211 commented Feb 29, 2024 •

edited

Loading

guojidan commented Feb 29, 2024

jayzhan211 commented Mar 2, 2024

jayzhan211 commented Mar 2, 2024 •

edited

Loading

guojidan commented Mar 6, 2024

jayzhan211 commented Mar 6, 2024

guojidan commented Mar 7, 2024

guojidan commented Mar 8, 2024 •

edited

Loading

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

Conversation

guojidan commented Feb 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guojidan commented Feb 28, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guojidan commented Feb 29, 2024 • edited by jayzhan211 Loading

jayzhan211 commented Feb 29, 2024

guojidan commented Feb 29, 2024

jayzhan211 commented Feb 29, 2024 • edited Loading

guojidan commented Feb 29, 2024

jayzhan211 commented Mar 2, 2024

jayzhan211 commented Mar 2, 2024 • edited Loading

guojidan commented Mar 6, 2024

jayzhan211 commented Mar 6, 2024

guojidan commented Mar 7, 2024

guojidan commented Mar 8, 2024 • edited Loading

jayzhan211 Feb 28, 2024 •

edited

Loading

guojidan commented Feb 29, 2024 •

edited by jayzhan211

Loading

jayzhan211 commented Feb 29, 2024 •

edited

Loading

jayzhan211 commented Mar 2, 2024 •

edited

Loading

guojidan commented Mar 8, 2024 •

edited

Loading