-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix(substrait): Do not add implicit groupBy expressions in LogicalPlanBuilder
or when building logical plans from Substrait
#14860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e3f3466
1a2ca1a
99a4c79
3687bc4
600d616
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -308,6 +308,17 @@ async fn aggregate_grouping_rollup() -> Result<()> { | |
).await | ||
} | ||
|
||
#[tokio::test] | ||
async fn multilayer_aggregate() -> Result<()> { | ||
assert_expected_plan( | ||
"SELECT a, sum(partial_count_b) FROM (SELECT a, count(b) as partial_count_b FROM data GROUP BY a) GROUP BY a", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still a bit confused by this. Why does the original plan created from SQL already contain the implicit groupBy's here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test goes SQL -> Logical Plan -> Substrait Plan -> Logical Plan. There were no additional groupBys added between (Prior to this PR) Additional groupBys were added between |
||
"Aggregate: groupBy=[[data.a]], aggr=[[sum(count(data.b)) AS sum(partial_count_b)]]\ | ||
\n Aggregate: groupBy=[[data.a]], aggr=[[count(data.b)]]\ | ||
\n TableScan: data projection=[a, b]", | ||
true | ||
).await | ||
} | ||
|
||
#[tokio::test] | ||
async fn decimal_literal() -> Result<()> { | ||
roundtrip("SELECT * FROM data WHERE b > 2.5").await | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default now
false
? That's a breaking change I guess? Is that fine, or should we make the default betrue
to maintain earlier behavior?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default is now
false
.My intuition was that this made the most sense as I'd prefer a composable base builder that suits all needs, that allows opt-ins to advanced features and optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although admittedly, I was not aware that this was going to be a breaking change, since SQL and Dataframes maintain the same behavior.
@vbarua informed me that it may be possible that there are those depending on the builder directly. In which case, I can see an argument for maintaining the default
true
behavior...I'm a new contributor to this project so I'm lacking context. I might lean on you folks to make a more informed decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards having the default behaviour be
false
for this, even if it's a breaking change, because it makes the builder less surprising IMO. Specifically, when invoking the builder for an aggregate with a specific set of grouping expressions, my expectation is that it should produce an aggregate with those specific grouping expressions. If I wanted additional grouping expressions, I would have included them.There's definitely room and value for optimizations like what is going on here, but I think those need to be opt-in to avoid situation like this were the plan builder tries to be smart along one specific axis and inadvertently shoots you in the foot in another. In the past, I think we've leaned towards having the builder be as straightforward as possible and then handling optimizations in the optimizer.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be less surprising, but I think we should consider it in a follow on PR. This PR fixes the issue with substrait and is a good change on its own. I have filed a follow on ticket to track the idea:Update: I see this PR changes the defaults. I updated the title of the PR and will mark it as an API change