-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to support recursive unnest in physical plan #11577
Refactor to support recursive unnest in physical plan #11577
Conversation
…iple-projection-reference
@@ -165,6 +165,34 @@ select unnest(column1), column1 from unnest_table; | |||
6 [6] | |||
12 [12] | |||
|
|||
# unnest at different level at the same time | |||
query II | |||
select unnest([1,2,3]), unnest(unnest([[1,2,3]])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix #11689
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @duongcongtoai -- I started looking through this PR and it looks good so far -- I need to find a few more minutes to complete the review but I left some comments for now
} | ||
} | ||
|
||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add documentation to this struct about what it means (what computation it represents)? In the context of this PR it is clear, but once we merge I think the context will be lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, comment addressed
@@ -3049,17 +3051,54 @@ pub enum Partitioning { | |||
DistributeBy(Vec<Expr>), | |||
} | |||
|
|||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | |||
pub enum ColumnUnnestType { | |||
List(Vec<ColumnUnnestList>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add documentation to List
about what it is (ideally with an example of when it is used and how it is different to Inferred
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @duongcongtoai -- this is an epic PR. Very nice 👏
I think it is worth considering the public API (aka adding fields to UnnestOptions
rather than new methods), but otherwise this is looking really nice.
It is well tested and well commented
I left a few other comments that would be nice to address too -- comments and adding a few more tests.
Thanks again
cc @jayzhan211
/// if one column is a list type, it can be recursively and simultaneously | ||
/// unnested into the desired recursion levels | ||
/// e.g select unnest(list_col,depth=1), unnest(list_col,depth=2) | ||
pub fn unnest_columns_recursive_with_options( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate API would be to add the ColumnUnnestType
as a field on UnnestOptions
(for that matter, maybe we should just add columns: Vec<(Column, ColumnUnnestType)>,
as a field 🤔
unnest_with_options(input, unnestings, UnnestOptions::default()) | ||
} | ||
|
||
pub fn get_unnested_list_datatype_recursive( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this needs to be pub
i think it should have some doc strings
} | ||
} | ||
|
||
pub fn get_struct_unnested_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here
preserve_nulls: true, | ||
}, | ||
)?; | ||
let actual = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the future, you an use assert_batches_eq
for this kind of comparison
assert_eq!( | ||
unnest_placeholder_columns, | ||
vec!["UNNEST(struct_col[matrix])"] | ||
// TODO: add a test case where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still something you mean to do?
@@ -165,6 +165,34 @@ select unnest(column1), column1 from unnest_table; | |||
6 [6] | |||
12 [12] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any .slt
tests for unnesting structures? these tests seem to be on arrays (Lists) rather than Structs
expr_in_unnest.clone().alias(placeholder_name.clone()), | ||
); | ||
|
||
// let post_unnest_column = Column::from_name(post_unnest_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like leftover
// Full context, we are trying to plan the execution as InnerProjection->Unnest->OuterProjection | ||
// inside unnest execution, each column inside the inner projection | ||
// will be transformed into new columns. Thus we need to keep track of these placeholding column names | ||
// let placeholder_name = unnest_expr.display_name()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like leftover
I think we can handle the follow ons / comments above as a follow on PR. Let's merge this one to keep things flowing. @duongcongtoai let me know if you plan to work on the suggestions. Thanks again. 🚀 onwards. 🙏 @jayzhan211 |
Yes thank you for the reviews, i'll create a follow up Pr |
* chore: poc * fix unnest struct * UT for memoization * remove unnessary projection * chore: temp test case * multi depth unnest supported * chore: add map of original column and transformed col * transformation map to physical layer * prototype for recursive array length * chore: some compile err * finalize input type in physical layer * chore: refactor unnest builder * add unnesting type inferred * fix compile err * fail test in builder * Compile err * chore: detect some bugs * some work * support recursive unnest in physical layer * UT for new build batch function * compile err * fix unnesting into empty arrays * some comment * fix unnest struct * some note * chore: fix all test failure * fix projection pushdown * custom rewriter for recursive unnest * simplify * rm unnecessary projection * chore: better comments * more comments * chore: better comments * remove breaking api * rename * more unit test * remove debug * clean up * fix proto * fix dataframe * fix clippy * cargo fmt * fix some test * fix all test * fix unnest in join * fix doc and tests * chore: better doc * better doc * tune comment * rm todo * refactor * chore: reserve test * add a basic test * chore: more document * doc on ColumnUnnestType List * chore: add partialord to new types
Which issue does this PR close?
Closes #11198 and #11689
Rationale for this change
We are not properly handling unnest where there are multiple unnest exprs at different level involved in the select exprs
What changes are included in this PR?
Implement
RecursiveUnnestRewriter
It can transform a composite expr of
unnest(unnest(unnest(columnA)))
into a logical planunnest(columnA, depth=3)
This implementation also fixed #11198, by the usage of
push_projection_dedupl
, meaning during the traversal of the complex expr, if a column is already projected, it should not be projected again (check #11198 (comment) for further context)Support recursive unnest at physical layer by calling
list_unnest_at_level
multiple timesTo fix #11689, i took a look how DuckDB and Postgres behave, where in the query there are multiple unnesting exprs, but with different recursion level.
This implementation follows DuckDB's behavior. For more context on the behavior difference, check this comment
This function is used to execute the unnesting on multiple columns all at once, but one level at a time, and is called n times, where n is the highest recursion level among the unnest exprs in the query.
For example giving the following query:
Then the total times this function being called is 3
It needs to be aware of which level the current unnesting is, because if there exists multiple unnesting on the same column, but with different recursion levels, say unnest(colA, max_depth:=3) and unnest(colA, max_depth:=2), then the unnesting of expr unnest(colA, max_depth:=3) will start at level 3, while unnesting for expr unnest(colA, max_depth:=2) has to start at level 2
Set colA as a 3-dimension columns and colB as an array (1-dimension). As stated, this function is called with the descending order of recursion depth
Depth = 3
Depth = 2
Depth = 1
Are these changes tested?
Are there any user-facing changes?