-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable non-uniform field type for structs created in DataFusion #8463
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me.
BTW, I think we also need a slt test for struct()
function in the original issue
datafusion/expr/src/signature.rs
Outdated
@@ -95,6 +95,8 @@ pub enum TypeSignature { | |||
VariadicEqual, | |||
/// One or more arguments with arbitrary types | |||
VariadicAny, | |||
/// arbitrary number of arguments out of a possibly different but limited set of valid types | |||
VariadicLimited(Vec<DataType>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have Variadic
, VariadicLimited
and VariadicEqual
, VariadicAny
. Their behaviors are similar, and I'm wondering how we can better reflect their difference in naming.
The tail Limited
also applies to Variadic
. So Variadic
is in actual "variadic" + "equal" + "limited", and the new one is "variadic" + "any" + "limited". How about changing Variadic
to VariadicLimitedEqual
and naming the new one VariadicLimitedAny
? But they are a bit verbose, is there any other options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had similar thoughts about the ambiguity of the names and went with VariadicLimited
because I didn't want to change too much.
I am pro-verbosity: tab-completion makes verbosity less painful but there is no similar thing for ambiguity.
I wasn't sure where such a test would go or how it would look. Can you point me to an example? |
To resolve these conflicts should I upmerge main? (I started from
|
I believe this will also close the underlying issue reported in #7012 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @dlovell . Very much appreciated.
I think any change in DataFusion needs to have some end to end tests in sqllogictests -- see https://github.com/apache/arrow-datafusion/tree/main/datafusion/sqllogictest -- perhaps you can add in the test from #8118
In this case I was thinking about it, and I don't see any reason we need to restrict the types of arguments to the struct
function at all. Arrow imposes no restrictions of the types of fields in a StructArray so I don't see why DataFusion is imposing any restrictions either
When I took out all the checks for argument type like this:
diff --git a/datafusion/expr/src/built_in_function.rs b/datafusion/expr/src/built_in_function.rs
index d48e9e7a6..1a9822d76 100644
--- a/datafusion/expr/src/built_in_function.rs
+++ b/datafusion/expr/src/built_in_function.rs
@@ -959,8 +959,7 @@ impl BuiltinScalarFunction {
],
self.volatility(),
),
- BuiltinScalarFunction::Struct => Signature::variadic(
- struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(),
+ BuiltinScalarFunction::Struct => Signature::variadic_any(
self.volatility(),
),
BuiltinScalarFunction::Concat
diff --git a/datafusion/physical-expr/src/struct_expressions.rs b/datafusion/physical-expr/src/struct_expressions.rs
index 0eed1d16f..1d30702bb 100644
--- a/datafusion/physical-expr/src/struct_expressions.rs
+++ b/datafusion/physical-expr/src/struct_expressions.rs
@@ -34,31 +34,14 @@ fn array_struct(args: &[ArrayRef]) -> Result<ArrayRef> {
.enumerate()
.map(|(i, arg)| {
let field_name = format!("c{i}");
- match arg.data_type() {
- DataType::Utf8
- | DataType::LargeUtf8
- | DataType::Boolean
- | DataType::Float32
- | DataType::Float64
- | DataType::Int8
- | DataType::Int16
- | DataType::Int32
- | DataType::Int64
- | DataType::UInt8
- | DataType::UInt16
- | DataType::UInt32
- | DataType::UInt64 => Ok((
+ Ok((
Arc::new(Field::new(
field_name.as_str(),
arg.data_type().clone(),
true,
)),
arg.clone(),
- )),
- data_type => {
- not_impl_err!("Struct is not implemented for type '{data_type:?}'.")
- }
- }
+ ))
})
.collect::<Result<Vec<_>>>()?;
The test case from @yukkit seems to create the field types unchanged (as verified using arrow_typeof
):
DataFusion CLI v33.0.0
❯ CREATE TABLE values(
c0 INT,
c1 String,
c2 String
) AS VALUES
(1, 'a', 'a'),
(2, 'b', 'b'),
(3, 'c', 'c');
0 rows in set. Query took 0.015 seconds.
❯ select arrow_typeof(struct(c0, c1, c2)) from VALUES;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| arrow_typeof(struct(values.c0,values.c1,values.c2)) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
| Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set. Query took 0.005 seconds.
Maybe @Ted-Jiang who seems to have added struct
in #2389 has some contet about why the types were restricted
dev/changelog/33.0.0.md
Outdated
@@ -17,9 +17,9 @@ | |||
under the License. | |||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this changelog file was modified in this PR. Perhaps it was a mistake 🤔
It seems to be some part of a change added in #8144
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working from branch-33
so I could build datafusion-python
33.0.0
against the changes.
Should I rebase onto main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working from
branch-33
so I could builddatafusion-python
33.0.0
against the changes.Should I rebase onto
main
?
Yes please (when we are ready to merge this)
@@ -890,7 +890,7 @@ impl BuiltinScalarFunction { | |||
// 0 or more arguments of arbitrary type | |||
Signature::one_of(vec![VariadicAny, Any(0)], self.volatility()) | |||
} | |||
BuiltinScalarFunction::Struct => Signature::variadic( | |||
BuiltinScalarFunction::Struct => Signature::variadic_limited( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about using Signature::variadic_any
? That I think would pass along the argument types directly (not try to coerce them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the simplest solution. If I make that change, I presume I should remove TypeSignature::VariadicLimited
?
Ah, I wasn't aware of |
@alamb Looking into it, its not clear to me how @yukkit 's test is different from struct scalar function with columns #1, which already exists but also passes. |
@alamb I think I get it: I should test the explain output (which does change with the changes to struct), not the value that is written to the slt file by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dlovell -- this looks great!
@@ -890,8 +890,7 @@ impl BuiltinScalarFunction { | |||
// 0 or more arguments of arbitrary type | |||
Signature::one_of(vec![VariadicAny, Any(0)], self.volatility()) | |||
} | |||
BuiltinScalarFunction::Struct => Signature::variadic( | |||
struct_expressions::SUPPORTED_STRUCT_TYPES.to_vec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also remove SUPPORTED_STRUCT_TYPES
entirely (I don't think it is used anywhere else)
We can do so as a follow on PR too.
@@ -58,5 +58,16 @@ select struct(a, b, c) from values; | |||
{c0: 2, c1: 2.2, c2: b} | |||
{c0: 3, c1: 3.3, c2: c} | |||
|
|||
# explain struct scalar function with columns #1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I double checked and before this PR this query results in casts
❯ explain select struct(a, b, c) from values;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: struct(CAST(values.a AS Utf8), CAST(values.b AS Utf8), values.c) |
| | TableScan: values projection=[a, b, c] |
| physical_plan | ProjectionExec: expr=[struct(CAST(a@0 AS Utf8), CAST(b@1 AS Utf8), c@2) as struct(values.a,values.b,values.c)] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.009 seconds.
@dlovell -- this PR has conflicts that need to be resolved before I can merge into main (basically we need to rebase this PR against main) Is that something you would like to do? |
2536cda
to
4474b07
Compare
4474b07
to
323ceb0
Compare
@alamb I did the rebase, let me know if there's anything that needs to be changed. |
Thanks again @dlovell -- BTW I love bug fixes that involve deleting code ❤️ |
🤔 looks like we need to run |
This looks great! |
@alamb I ran |
1 similar comment
@alamb I ran |
@alamb apologies, just found out about |
No worries -- thanks for sticking with it @dlovell |
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
…he#8463) * feat: struct: implement variadic_any solution, enable all struct field types * fix: run cargo-fmt * cln: remove unused imports
Which issue does this PR close?
Closes #8118
Rationale for this change
Datafusion can read in data with / have structs with fields of different types, but currently can't create a struct column with fields of different types: struct creation casts of all the struct fields to the same type.
What changes are included in this PR?
This PR defines
TypeSignature::VariadicLimited
and assigns it as the return type ofBuiltinScalarFunction::Struct
.TypeSignature::VariadicLimited
is like aTypeSignature::VariadicAny
with a limited definition ofAny
and is distinct fromTypeSignature::Variadic
in that it does not cast any fields.Are these changes tested?
There is a test for a successful and an unsuccessful invocation of
get_valid_types
forTypeSignature::VariadicLimited
Are there any user-facing changes?
The type of the generated struct column will be different: users who are expecting a cast will be surprised.