Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move make_array, array_{intersect, union, distinct} to datafusion-functions-array` #9222

Closed
wants to merge 7 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 13, 2024

Draft while I figure out:

  • Get tests passing
  • Review string rewrite / concat in the rewriter and figure out how/why it is necessary (doesn't simplify already do it??)

Which issue does this PR close?

Part of #8045
re #9100

Rationale for this change

Set the pattern for how more array functions can be moved from datafusion to datafusion-functions-array crates

What changes are included in this PR?

  1. Move make_array, array_intersect, array_union and array_distinct function to datafusion-functions-array
  2. Remove vestigal functions around

Are these changes tested?

Yes, by existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules labels Feb 13, 2024
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a start on moving array functions -- I still need to work on this a bit more. FYI @jayzhan211

@@ -229,121 +228,6 @@ fn check_datatypes(name: &str, args: &[&ArrayRef]) -> Result<()> {
Ok(())
}

/// Convert one or more [`ArrayRef`] of the same type into a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the code from here to the datafusion-function-array crate

@@ -564,7 +564,7 @@ enum ScalarFunction {
Sqrt = 17;
Tan = 18;
Trunc = 19;
Array = 20;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array is an alias for make_array it seems

);
}

fn align_array_dimensions<O: OffsetSizeTrait>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this over from array_expressions where it is used in one other place. I'll try and figure out how to remove it shortly

@@ -1,254 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this code to to_string.rs to align with keeping the ScalarFunction definitions inline with their implementations

use std::any::Any;

// Create static instances of ScalarUDFs for each function
make_udf_function!(ArrayToString,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to to_string.rs

}

// TODO figure out how to generalize this so it isn't hard coded by name
match (fn_name(left), fn_name(right)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we have specific rewrites here for functions (that can't be done by constant folding or some other method). I don't like special casing the names

I need to look into it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Weijun-H do you happen to remember the rationale for this code?

BuiltinScalarFunction::MakeArray,
values,
)))
let make_array = self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting -- now the SQL planner needs to replace the array literals with a scalar function, but it does the lookup by name. Rather than having this string hard coded maybe there should be a list of such functions somewhere 🤔

make_array,
array,
"returns an Arrow array using the specified input expressions.",
udf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be make_array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled to come up with a good name. The macro is basically creating two functions

/// fluent expr style
fn make_array(array: Expr) -> Expr {
}

/// return a ScalarUDF
fn udf() -> ScalarUdf { 
...
}

The udf one is used to register the function:

/// Registers all enabled packages with a [`FunctionRegistry`]
pub fn register_all(registry: &mut dyn FunctionRegistry) -> Result<()> {
    let functions: Vec<Arc<ScalarUDF>> = vec![
        to_string::udf(),
        make_array::udf(),
        set_ops::union_udf(),
        set_ops::intersect_udf(),
        set_ops::distinct_udf(),
    ];

So it can't be the expr_fn name , but udf is also pretty generic 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_udf() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe make_udf 🤔

This comment was marked as outdated.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should name it make_array_udf or something else including a function name since we might not be able to reuse udf or make_udf for other array functions. Prefixing it with array_function seems like a better choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants