Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate #9343

Closed
wants to merge 13 commits into from

Conversation

guojidan
Copy link
Contributor

Which issue does this PR close?

Closes #9322 .

Rationale for this change

What changes are included in this PR?

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Feb 26, 2024
datafusion/physical-expr/src/functions.rs Outdated Show resolved Hide resolved
@@ -45,14 +45,14 @@
///
/// [`ScalarUDFImpl`]: datafusion_expr::ScalarUDFImpl
macro_rules! make_udf_function {
($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => {
($UDF:ty, $EXPR_FN:ident, $arg:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change to vec? previous syntax is able to parse arbitrary args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function with an indefinite number of args, like ArrayConcat or MakeArray, previous syntax unable to handle well,
And we don't care about args name, ScalarFunction::args just need an Vec, So I think change to vec is reasonable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Can we have another macro handle arbitrary args function?

like what we have before

macro_rules! scalar_expr {
    ($ENUM:ident, $FUNC:ident, $($arg:ident)*, $DOC:expr) => {
        #[doc = $DOC]
        pub fn $FUNC($($arg: Expr),*) -> Expr {
            Expr::ScalarFunction(ScalarFunction::new(
                built_in_function::BuiltinScalarFunction::$ENUM,
                vec![$($arg),*],
            ))
        }
    };
}

macro_rules! nary_scalar_expr {
    ($ENUM:ident, $FUNC:ident, $DOC:expr) => {
        #[doc = $DOC ]
        pub fn $FUNC(args: Vec<Expr>) -> Expr {
            Expr::ScalarFunction(ScalarFunction::new(
                built_in_function::BuiltinScalarFunction::$ENUM,
                args,
            ))
        }
    };
}

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid wrapping single element into vec![], which is quite nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this, I am tested, it work well :

macro_rules! make_udf_function {
    ($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => {
        paste::paste! {
            // "fluent expr_fn" style function
            #[doc = $DOC]
            pub fn $EXPR_FN($($arg: Expr),*) -> Expr {
                Expr::ScalarFunction(ScalarFunction::new_udf(
                    $SCALAR_UDF_FN(),
                    vec![$($arg),*],
                ))
            }

            /// Singleton instance of [`$UDF`], ensures the UDF is only created once
            /// named STATIC_$(UDF). For example `STATIC_ArrayToString`
            #[allow(non_upper_case_globals)]
            static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
                std::sync::OnceLock::new();

            /// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
            ///
            /// [`ScalarUDF`]: datafusion_expr::ScalarUDF
            pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
                [< STATIC_ $UDF >]
                    .get_or_init(|| {
                        std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
                            <$UDF>::new(),
                        ))
                    })
                    .clone()
            }
        }
    };
    ($UDF:ty, $EXPR_FN:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => {
        paste::paste! {
            // "fluent expr_fn" style function
            #[doc = $DOC]
            pub fn $EXPR_FN(arg: Vec<Expr>) -> Expr {
                Expr::ScalarFunction(ScalarFunction::new_udf(
                    $SCALAR_UDF_FN(),
                    arg,
                ))
            }

            /// Singleton instance of [`$UDF`], ensures the UDF is only created once
            /// named STATIC_$(UDF). For example `STATIC_ArrayToString`
            #[allow(non_upper_case_globals)]
            static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
                std::sync::OnceLock::new();

            /// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
            ///
            /// [`ScalarUDF`]: datafusion_expr::ScalarUDF
            pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
                [< STATIC_ $UDF >]
                    .get_or_init(|| {
                        std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
                            <$UDF>::new(),
                        ))
                    })
                    .clone()
            }
        }
    };
}

datafusion/optimizer/src/analyzer/rewrite_expr.rs Outdated Show resolved Hide resolved
}

fn invoke(&self, args: &[ColumnarValue]) -> datafusion_common::Result<ColumnarValue> {
make_scalar_function_with_hints(crate::kernels::array_append)(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we dont need this. I play around it and find that there is error in

https://github.com/apache/arrow-datafusion/blob/32d906fc9622af3a67b3828700272092fe0982a0/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L526-L530

LargeList casting panics

If the list casting is fixed, probably we dont need make_scalar_function_with_hints anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if no make_scalar_function_with_hints function, arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs will pannic, cause by arrow-rs not support cast LargeList, So At present, we still need make_scalar_function_with_hints

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use as_list::<i64> for large list, and as_list::<i32> for list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think make_scalar_function_with_hints function have clear logic, determine output is Scalar or Array, I want
to keep it 🤔

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is any case that we need Scalar here. If so, we definitely can keep make_scalar_function_with_hints , otherwise, we can add it back until we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @alamb can give some advise 😄

@github-actions github-actions bot removed the core Core DataFusion crate label Feb 28, 2024
@guojidan
Copy link
Contributor Author

@jayzhan211 thank you very much for your review 😄

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @guojidan and @jayzhan211 for the review. I am only concerned about the newly introduced dependnecy between datafusion-optimizer and datafusion-functions-array.... I wonder if we can find a way to avoid that dependency (maybe we could move the array specific rewrites into their own pass somehow 🤔 )

@@ -578,7 +578,14 @@ async fn roundtrip_expr_api() -> Result<()> {
let expr_list = vec![
encode(col("a").cast_to(&DataType::Utf8, &schema)?, lit("hex")),
decode(lit("1234"), lit("hex")),
array_to_string(array(vec![lit(1), lit(2), lit(3)]), lit(",")),
array_to_string(make_array(vec![lit(1), lit(2), lit(3)]), lit(",")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -44,6 +44,7 @@ async-trait = { workspace = true }
chrono = { workspace = true }
datafusion-common = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions-array = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this new dependency -- it means that using the optimizer will require bringing in the physical exprs, etc and users (like dask-sql) that only want the planner code would bring in substantial unused code.

I realize the reason this is required is to preserve the semantics of the existing rewrite pass in datafusion/optimizer/src/analyzer/rewrite_expr.rs. I wonder if we can somehow avoid adding this new dependency 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can split udf to logical expr and physcial expr, and export only logical expr conditionally (with #[cfg(feature = "array_expressions")]) to optimizer, so we can do optimizing for udfs, but only import the necessary one for user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid importing function-arrays if we get udf via self.context_provider.get_function_meta(&name) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid importing function-arrays if we get udf via self.context_provider.get_function_meta(&name) ?

wow, This may be feasible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jayzhan211 , Do you think my current approach to the analyzer is feasible? I move array analyzer to datafusion-functions-array crate, add array analyzer to analyzer rules if array_expression featrue is enable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a good idea, it seems like duplicated code without any good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guojidan
let's see if it is possible to optionally import datafusion-functions-array, it seems an easier way than bringing udfs to optimizer.

@github-actions github-actions bot added the core Core DataFusion crate label Feb 29, 2024
@guojidan
Copy link
Contributor Author

guojidan commented Feb 29, 2024

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

@jayzhan211
Copy link
Contributor

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category.
if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

@guojidan
Copy link
Contributor Author

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 29, 2024

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

#9100 (comment)
Edit: Let's keep it in function-arrays first, unless there is any good reason

@guojidan
Copy link
Contributor Author

hi @jayzhan211 , if make_array is conditionally export, how can I deal this function: https://github.com/apache/arrow-datafusion/blob/main/datafusion/sql/src/expr/value.rs#L133, thank you

I think it depends on whether we should place make_array into additional array category or core function category. if we consider make_array to be supported only if cfg array_expression is set, than we also support array literal if the flag is set, otherwise we support it by default.

yes, I think make_array should be a core function

#9100 (comment) Edit: Let's keep it in function-arrays first, unless there is any good reason

as you said, we can use self.context_provider.get_function_meta("make_array") replace

@jayzhan211 jayzhan211 marked this pull request as draft March 2, 2024 05:21
@jayzhan211
Copy link
Contributor

I converted it to the draft, feel free to ping me when the PR is ready to review

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 2, 2024

#9100 (comment) Edit: Let's keep it in function-arrays first, unless there is any good reason

as you said, we can use self.context_provider.get_function_meta("make_array") replace

Remember to replace them in rewriter too

@guojidan
Copy link
Contributor Author

guojidan commented Mar 6, 2024

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

  1. don't move rewriter_expr.rs file into datafusion-functions-array crate
  2. use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
    Is there anything else to add?

@jayzhan211
Copy link
Contributor

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

  1. don't move rewriter_expr.rs file into datafusion-functions-array crate
  2. use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
    Is there anything else to add?

Nope

@guojidan
Copy link
Contributor Author

guojidan commented Mar 7, 2024

hi @jayzhan211 , sorry long time to reply, this pr's todo list is:

  1. don't move rewriter_expr.rs file into datafusion-functions-array crate
  2. use self.context_provider.get_function_meta("make_array") replace make_array() function and so on in rewriter_expr.rs
    Is there anything else to add?

difficult to implement, Analyzer struct have not dyn ContextProvider member, and I don't known how add dyn ContextProvider into Analyzer, can you give me some tips? @jayzhan211

@guojidan
Copy link
Contributor Author

guojidan commented Mar 8, 2024

because this pr lasting for too long,difficult to rebase or merge, so I open a new pr #9504 , I will close this pr

@guojidan guojidan closed this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

move make_array array_append array_prepend array_concat function to datafusion-functions-array crate
3 participants