-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom struct field names with new scalar function named_struct #9743
Conversation
select struct(a as field_a, b) from t; | ||
+--------------------------------------------------+ | ||
| named_struct(Utf8("field_a"),t.a,Utf8("c1"),t.b) | | ||
+--------------------------------------------------+ | ||
| {field_a: 1, c1: 2} | | ||
| {field_a: 3, c1: 4} | | ||
+--------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please tell me why would this test change?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this syntax is not supported. This PR adds support for it by actually calling named_struct
instead, which itself can't support this syntax because it would be ambiguos, e.g: named_struct('name1', 1 as name2)
Essentialy struct
is treated as expression and is rewrited to a named_struct
function call.
But perhaps it's a bit confusing? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks! :)
I think it would be better to add a new test case to illustrate this new feature rather than changing the old one? Just my two cents, let's see what others would think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed -- please do add a new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -50,6 +50,12 @@ select struct(1, 3.14, 'e'); | |||
---- | |||
{c0: 1, c1: 3.14, c2: e} | |||
|
|||
# struct scalar function #1 with alias | |||
query ? | |||
select struct(1 as "name0", 3.14 as name1, 'e', true as 'name3'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very cool -- I didn't realize we supported this syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, i thought it's a generic syntax supported basically anywhere, only after your comment i realized it's specific to struct(...)
Err(datafusion_common::DataFusionError::Internal( | ||
"return_type called instead of return_type_from_exprs".into(), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use internal_err!
here and also include the name named_struct
as part of the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can just delete this method return_type
. I don't think in any case this will be called 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yyy1000 ideally yes, but it's a required method. If I remember correctly, it's because of backwards compatibility, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gstvg Ohhh, yes. Sorry for that, I forgot return_type didn't have a default implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with an idea
array_ref, | ||
)) | ||
} else { | ||
exec_err!("named_struct even arguments must be string literals") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to below, I think it would be valuable to explictly describe here what was received instead so a user who encounters this error will more easily be able to find their mistake and correct it
--TableScan: values projection=[a, b, c] | ||
physical_plan | ||
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)] | ||
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)] | ||
--MemoryExec: partitions=1, partition_sizes=[1] | ||
|
||
statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add a few more tests:
- Error case: named_struct with 0 arguments
- Error case: named_struct wth an odd number of arguments
- Error case: named_struct with an even number of arguments
- A test where the arguments are both columns and arrays (example below)
Here is an example (I expect it to panic in this PR at first -- see #9775 for how to fix it)
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.014 seconds.
❯ select named_struct(x as 'col_x', 25 as scalar) from t;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, also a basic test case like the document you added.
select named_struct('field_a', a, 'field_b', b) from t;
Edit: or maybe you can add the test cases to a new test file?
if let ColumnarValue::Scalar(ScalarValue::Utf8(Some(name))) = name { | ||
let array_ref = match value { | ||
ColumnarValue::Array(array) => array.clone(), | ||
ColumnarValue::Scalar(scalar) => scalar.to_array()?.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to turn this into the correct length to match any array arguments (I mention a test below that will fix the problem) otherwise the sizes of the arguments will be mismatched
I suggest using ColumnarValues::values_to_array
if possible
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
/// put values in a struct array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding a new function, I think we could instead change the existing struct
UDF to support explicit named fields.
Perhaps we can do that as a follow on PR?
--TableScan: values projection=[a, b, c] | ||
physical_plan | ||
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)] | ||
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also have the same concern as @alamb said in #9743 (comment). It seems that from here, the current struct
UDF will never be called because calling struct
will rewrite into named_struct
UDF 👀
We can also leave this and probably add it as a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9839 for follow up
It looks nice to me overall, and I also added some comment which may help you improve the PR :) @gstvg |
Thank you guys for all the comments and the review. Prior to applying the requested changes, I wanna say that I think I didn't elaborated enough what this PR does and most importantly, why it does that way, and now I'm afraid maybe this is not the best way. How @alamb noted and @yyy1000 agreed, ideally we would simply extend the current struct function to support named arguments, but because of the scalar function signature We could modify the existing struct func to accept pairs of name and value, but it would be breaking change, e.g: given an existing call And most importantly, It would not support the much nicer syntax So, we can create a new separate function, that accept pairs of name and value, but not named arguments, and it would give us a minimun level of support without a nice syntax, and also naming it Then, there's the improvement of supporting the Currently, this PR, while parsing Prior opening this PR, I though this:
I was wrapping the call with a cast while parsing But if instead we parse
we are using it in the opposite way it was designed for. It is okay to use it that way, updating the docs, or it's misuse? So, given the options:
We can do:
What do you think? |
Thanks for your explain, @gstvg ! |
@gstvg Is it possible to have a syntax like
Do not allow partial named struct like After the named struct |
Yes, you're right, Edit: link @yyy1000 |
It would require a PR to sqlparser-rs
Fair enough
Yes, after minimal support is merged, and support for the syntax is added on sqlparser-rs, it would be trivial to support it, leaving us with a very nice support for structs! Edit: link @jayzhan211 |
Thank you @gstvg -- I think your solution 1.1 sounds like the best plan to me (and basically what this PR does). I think this PR is very important and will unlock several other great usecases for DataFusion so thank you again for working on it
I agree that it would be nice to remove the old struct udf -- let's do that as a follow on PR (where we can discuss the merits / requirements for backwards compatibility) It looks like @jayzhan211 and you are already busy working on the sqlparser support for supporting the duckdb style literal syntax 🙏 Thus here is what I think we should do:
|
|
||
let name = match name_column { | ||
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar, | ||
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
--TableScan: values projection=[a, b, c] | ||
physical_plan | ||
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)] | ||
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9839 for follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get the CI passing by merging this PR up from main. We are getting very close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you a lot! @gstvg
Err(datafusion_common::DataFusionError::Internal( | ||
"return_type called instead of return_type_from_exprs".into(), | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't quite figure out what code this was in relation to (the github UI is being weird). But in general I am a fan of using internal errors if possible even if they really are unreachable. The rationale being that if there is some bug that gets introduced in some future refactoring, then the failure mode is nicer than panic'ing
Thanks again @gstvg 🙏 |
And thanks for the reviews and suggestions @yyy1000 and @jayzhan211 |
…ct (apache#9743) * Support custom struct field names with new scalar function named_struct * add tests and corretly handle mixed arrray and scalar values * fix slt * fmt * port test to slt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #5861
What changes are included in this PR?
named_struct
which accepts pairs of name and values, made possible with Support compute return types from argument values (not just their DataTypes) #8985named_struct
call, instead of the existingstruct
, and if any expression is a sqlparser::Expr::Named, use its name as the struct field name, otherwise fallback to the currently used cN convention.Are these changes tested?
Yes, the new scalar function has a unit test similar to the existing struct function, and a sql logic test has been added too.
Are there any user-facing changes?
New scalar function.
Existing
struct
function is not changed.Alternatives considered
struct
call wrapped in a Expr::Cast to a struct with custom field names, but this would only benefit SQL users.