-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix(substrait): Do not add implicit groupBy expressions in LogicalPlanBuilder
or when building logical plans from Substrait
#14860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anlinc It looks like there is a small clippy failure with this PR: https://github.com/apache/datafusion/actions/runs/13507870042/job/37742937287?pr=14860 |
pub struct LogicalPlanBuilderOptions { | ||
/// Flag indicating whether the plan builder should add | ||
/// functionally dependent expressions as additional aggregation groupings. | ||
add_implicit_group_by_exprs: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default now false
? That's a breaking change I guess? Is that fine, or should we make the default be true
to maintain earlier behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default is now false
.
My intuition was that this made the most sense as I'd prefer a composable base builder that suits all needs, that allows opt-ins to advanced features and optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although admittedly, I was not aware that this was going to be a breaking change, since SQL and Dataframes maintain the same behavior.
@vbarua informed me that it may be possible that there are those depending on the builder directly. In which case, I can see an argument for maintaining the default true
behavior...
I'm a new contributor to this project so I'm lacking context. I might lean on you folks to make a more informed decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards having the default behaviour be false
for this, even if it's a breaking change, because it makes the builder less surprising IMO. Specifically, when invoking the builder for an aggregate with a specific set of grouping expressions, my expectation is that it should produce an aggregate with those specific grouping expressions. If I wanted additional grouping expressions, I would have included them.
There's definitely room and value for optimizations like what is going on here, but I think those need to be opt-in to avoid situation like this were the plan builder tries to be smart along one specific axis and inadvertently shoots you in the foot in another. In the past, I think we've leaned towards having the builder be as straightforward as possible and then handling optimizations in the optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be less surprising, but I think we should consider it in a follow on PR. This PR fixes the issue with substrait and is a good change on its own. I have filed a follow on ticket to track the idea:
Update: I see this PR changes the defaults. I updated the title of the PR and will mark it as an API change
#[tokio::test] | ||
async fn multilayer_aggregate() -> Result<()> { | ||
assert_expected_plan( | ||
"SELECT a, sum(partial_count_b) FROM (SELECT a, count(b) as partial_count_b FROM data GROUP BY a) GROUP BY a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit confused by this. Why does the original plan created from SQL already contain the implicit groupBy's here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test goes SQL -> Logical Plan -> Substrait Plan -> Logical Plan.
There were no additional groupBys added between SQL -> Logical Plan -> Substrait Plan
.
(Prior to this PR) Additional groupBys were added between Substrait Plan -> Logical Plan
, violating the round trip plan expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @anlinc , great to get this fixed and the approach here looks good! Just a note on the default in LogicalPlanBuilderOptions - should it be "true" to not break potential other users?
And a question on one of the tests, just for my understanding 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me overall. Thanks for working on this @anlinc
LogicalPlanBuilder
or when building logical plans from Substrait
…anBuilder` or when building logical plans from Substrait (apache#14860) * feat: add add_implicit_group_by_exprs option to logical plan builder * fix: do not add implicity group by exprs in substrait path * test: add substrait tests * test: add builder option tests * style: clippy errors
Which issue does this PR close?
Closes #14348
Rationale for this change
Substrait plans are intended to be interpreted literally. When you see plan nodes like:
The output mapping (e.g.
[0, 3]
) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.What changes are included in this PR?
add_implicit_group_by_exprs
option to the logical plan builder. It is disabled by default.Are these changes tested?
Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.
There should be no changes in the non-Substrait paths.
Are there any user-facing changes?
Yes. LogicalPlanBuilder will not longer add implicity group by expressions by default.
SQL, DataFrame paths remain untouched.
Appendix
This is a continuation of #14553.