Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable non-uniform field type for structs created in DataFusion #8463

Merged
merged 3 commits into from
Dec 11, 2023

Conversation

dlovell
Copy link
Contributor

@dlovell dlovell commented Dec 7, 2023

Which issue does this PR close?

Closes #8118

Rationale for this change

Datafusion can read in data with / have structs with fields of different types, but currently can't create a struct column with fields of different types: struct creation casts of all the struct fields to the same type.

What changes are included in this PR?

This PR defines TypeSignature::VariadicLimited and assigns it as the return type of BuiltinScalarFunction::Struct. TypeSignature::VariadicLimited is like a TypeSignature::VariadicAny with a limited definition of Any and is distinct from TypeSignature::Variadic in that it does not cast any fields.

Are these changes tested?

There is a test for a successful and an unsuccessful invocation of get_valid_types for TypeSignature::VariadicLimited

Are there any user-facing changes?

The type of the generated struct column will be different: users who are expecting a cast will be surprised.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Dec 7, 2023
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me.

BTW, I think we also need a slt test for struct() function in the original issue

@@ -95,6 +95,8 @@ pub enum TypeSignature {
VariadicEqual,
/// One or more arguments with arbitrary types
VariadicAny,
/// arbitrary number of arguments out of a possibly different but limited set of valid types
VariadicLimited(Vec<DataType>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have Variadic, VariadicLimited and VariadicEqual, VariadicAny. Their behaviors are similar, and I'm wondering how we can better reflect their difference in naming.

The tail Limited also applies to Variadic. So Variadic is in actual "variadic" + "equal" + "limited", and the new one is "variadic" + "any" + "limited". How about changing Variadic to VariadicLimitedEqual and naming the new one VariadicLimitedAny? But they are a bit verbose, is there any other options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had similar thoughts about the ambiguity of the names and went with VariadicLimited because I didn't want to change too much.

I am pro-verbosity: tab-completion makes verbosity less painful but there is no similar thing for ambiguity.

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

BTW, I think we also need a slt test for struct() function in the original issue

I wasn't sure where such a test would go or how it would look. Can you point me to an example?

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

To resolve these conflicts should I upmerge main? (I started from branch-33 because I was making the changes so I could use them with datafusion-python 33.0.0.)

Conflicting files
datafusion-cli/Cargo.lock
datafusion/expr/src/built_in_function.rs 

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

I believe this will also close the underlying issue reported in #7012

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @dlovell . Very much appreciated.

I think any change in DataFusion needs to have some end to end tests in sqllogictests -- see https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest -- perhaps you can add in the test from #8118

In this case I was thinking about it, and I don't see any reason we need to restrict the types of arguments to the struct function at all. Arrow imposes no restrictions of the types of fields in a StructArray so I don't see why DataFusion is imposing any restrictions either

When I took out all the checks for argument type like this:

diff --git a/datafusion/expr/src/built_in_function.rs b/datafusion/expr/src/built_in_function.rs
index d48e9e7a6..1a9822d76 100644
--- a/datafusion/expr/src/built_in_function.rs
+++ b/datafusion/expr/src/built_in_function.rs
@@ -959,8 +959,7 @@ impl BuiltinScalarFunction {
                 ],
                 self.volatility(),
             ),
-            BuiltinScalarFunction::Struct => Signature::variadic(
-                struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(),
+            BuiltinScalarFunction::Struct => Signature::variadic_any(
                 self.volatility(),
             ),
             BuiltinScalarFunction::Concat
diff --git a/datafusion/physical-expr/src/struct_expressions.rs b/datafusion/physical-expr/src/struct_expressions.rs
index 0eed1d16f..1d30702bb 100644
--- a/datafusion/physical-expr/src/struct_expressions.rs
+++ b/datafusion/physical-expr/src/struct_expressions.rs
@@ -34,31 +34,14 @@ fn array_struct(args: &[ArrayRef]) -> Result<ArrayRef> {
         .enumerate()
         .map(|(i, arg)| {
             let field_name = format!("c{i}");
-            match arg.data_type() {
-                DataType::Utf8
-                | DataType::LargeUtf8
-                | DataType::Boolean
-                | DataType::Float32
-                | DataType::Float64
-                | DataType::Int8
-                | DataType::Int16
-                | DataType::Int32
-                | DataType::Int64
-                | DataType::UInt8
-                | DataType::UInt16
-                | DataType::UInt32
-                | DataType::UInt64 => Ok((
+            Ok((
                     Arc::new(Field::new(
                         field_name.as_str(),
                         arg.data_type().clone(),
                         true,
                     )),
                     arg.clone(),
-                )),
-                data_type => {
-                    not_impl_err!("Struct is not implemented for type '{data_type:?}'.")
-                }
-            }
+                ))
         })
         .collect::<Result<Vec<_>>>()?;

The test case from @yukkit seems to create the field types unchanged (as verified using arrow_typeof):

DataFusion CLI v33.0.0
❯ CREATE TABLE values(
    c0 INT,
    c1 String,
    c2 String
) AS VALUES
  (1, 'a', 'a'),
  (2, 'b', 'b'),
  (3, 'c', 'c');
0 rows in set. Query took 0.015 seconds.

❯ select arrow_typeof(struct(c0, c1, c2)) from VALUES;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(struct(values.c0,values.c1,values.c2))                                                                                                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set. Query took 0.005 seconds.

Maybe @Ted-Jiang who seems to have added struct in #2389 has some contet about why the types were restricted

@@ -17,9 +17,9 @@
under the License.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this changelog file was modified in this PR. Perhaps it was a mistake 🤔

It seems to be some part of a change added in #8144

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working from branch-33 so I could build datafusion-python 33.0.0 against the changes.

Should I rebase onto main?

Copy link
Contributor

@alamb alamb Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working from branch-33 so I could build datafusion-python 33.0.0 against the changes.

Should I rebase onto main?

Yes please (when we are ready to merge this)

@@ -890,7 +890,7 @@ impl BuiltinScalarFunction {
// 0 or more arguments of arbitrary type
Signature::one_of(vec![VariadicAny, Any(0)], self.volatility())
}
BuiltinScalarFunction::Struct => Signature::variadic(
BuiltinScalarFunction::Struct => Signature::variadic_limited(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using Signature::variadic_any? That I think would pass along the argument types directly (not try to coerce them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the simplest solution. If I make that change, I presume I should remove TypeSignature::VariadicLimited?

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

I think any change in DataFusion needs to have some end to end tests in sqllogictests -- see https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest -- perhaps you can add in the test from #8118

Ah, I wasn't aware of sqllogictest, looking into it.

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

@alamb Looking into it, its not clear to me how @yukkit 's test is different from struct scalar function with columns #1, which already exists but also passes.

@dlovell
Copy link
Contributor Author

dlovell commented Dec 8, 2023

@alamb I think I get it: I should test the explain output (which does change with the changes to struct), not the value that is written to the slt file by --complete (which does not change with the changes to struct)

@github-actions github-actions bot added physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Dec 8, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dlovell -- this looks great!

@@ -890,8 +890,7 @@ impl BuiltinScalarFunction {
// 0 or more arguments of arbitrary type
Signature::one_of(vec![VariadicAny, Any(0)], self.volatility())
}
BuiltinScalarFunction::Struct => Signature::variadic(
struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also remove SUPPORTED_STRUCT_TYPES entirely (I don't think it is used anywhere else)

We can do so as a follow on PR too.

@@ -58,5 +58,16 @@ select struct(a, b, c) from values;
{c0: 2, c1: 2.2, c2: b}
{c0: 3, c1: 3.3, c2: c}

# explain struct scalar function with columns #1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I double checked and before this PR this query results in casts

❯ explain select struct(a, b, c) from values;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: struct(CAST(values.a AS Utf8), CAST(values.b AS Utf8), values.c)                                   |
|               |   TableScan: values projection=[a, b, c]                                                                       |
| physical_plan | ProjectionExec: expr=[struct(CAST(a@0 AS Utf8), CAST(b@1 AS Utf8), c@2) as struct(values.a,values.b,values.c)] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                |
|               |                                                                                                                |
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.

@alamb
Copy link
Contributor

alamb commented Dec 9, 2023

@dlovell -- this PR has conflicts that need to be resolved before I can merge into main (basically we need to rebase this PR against main)

Is that something you would like to do?

@dlovell dlovell force-pushed the variadic-limited-33 branch from 2536cda to 4474b07 Compare December 9, 2023 12:05
@dlovell dlovell force-pushed the variadic-limited-33 branch from 4474b07 to 323ceb0 Compare December 9, 2023 12:07
@dlovell
Copy link
Contributor Author

dlovell commented Dec 9, 2023

@alamb I did the rebase, let me know if there's anything that needs to be changed.

@alamb
Copy link
Contributor

alamb commented Dec 9, 2023

Thanks again @dlovell -- BTW I love bug fixes that involve deleting code ❤️

@alamb
Copy link
Contributor

alamb commented Dec 9, 2023

🤔 looks like we need to run cargo fmt and cargo clippy to get a clean CI run

@yukkit
Copy link
Contributor

yukkit commented Dec 9, 2023

This looks great!

@dlovell
Copy link
Contributor Author

dlovell commented Dec 9, 2023

@alamb I ran cargo fmt and there was a change, but no change for cargo clippy

1 similar comment
@dlovell
Copy link
Contributor Author

dlovell commented Dec 9, 2023

@alamb I ran cargo fmt and there was a change, but no change for cargo clippy

@dlovell
Copy link
Contributor Author

dlovell commented Dec 10, 2023

@alamb apologies, just found out about dev/rust_lint.sh

@alamb
Copy link
Contributor

alamb commented Dec 10, 2023

No worries -- thanks for sticking with it @dlovell

@alamb alamb merged commit 93b21bd into apache:main Dec 11, 2023
22 checks passed
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
…he#8463)

* feat: struct: implement variadic_any solution, enable all struct field types

* fix: run cargo-fmt

* cln: remove unused imports
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 3, 2024
…he#8463)

* feat: struct: implement variadic_any solution, enable all struct field types

* fix: run cargo-fmt

* cln: remove unused imports
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 3, 2024
…he#8463)

* feat: struct: implement variadic_any solution, enable all struct field types

* fix: run cargo-fmt

* cln: remove unused imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Function 'struct' may change the data type of input parameters
4 participants