Skip to content

Conversation

anlinc
Copy link
Contributor

@anlinc anlinc commented Feb 24, 2025

Which issue does this PR close?

Closes #14348

Rationale for this change

Substrait plans are intended to be interpreted literally. When you see plan nodes like:

"project": {
  "common": {
    "emit": {
      "outputMapping": [0, 3]
    }
  },
...
}

The output mapping (e.g. [0, 3]) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.

What changes are included in this PR?

  • Introduce new add_implicit_group_by_exprs option to the logical plan builder. It is disabled by default.
    • The option is enabled only in the DataFrame and SQL paths.

Are these changes tested?

Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.

There should be no changes in the non-Substrait paths.

Are there any user-facing changes?

Yes. LogicalPlanBuilder will not longer add implicity group by expressions by default.

SQL, DataFrame paths remain untouched.

Appendix

This is a continuation of #14553.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate substrait Changes to the substrait crate labels Feb 24, 2025
@anlinc anlinc marked this pull request as ready for review February 24, 2025 21:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @anlinc

I think this looks great to me

cc @Blizzara @vbarua

@alamb alamb mentioned this pull request Feb 24, 2025
10 tasks
@alamb
Copy link
Contributor

alamb commented Feb 24, 2025

Hi @anlinc

It looks like there is a small clippy failure with this PR:

https://github.com/apache/datafusion/actions/runs/13507870042/job/37742937287?pr=14860

pub struct LogicalPlanBuilderOptions {
/// Flag indicating whether the plan builder should add
/// functionally dependent expressions as additional aggregation groupings.
add_implicit_group_by_exprs: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default now false? That's a breaking change I guess? Is that fine, or should we make the default be true to maintain earlier behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the default is now false.

My intuition was that this made the most sense as I'd prefer a composable base builder that suits all needs, that allows opt-ins to advanced features and optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although admittedly, I was not aware that this was going to be a breaking change, since SQL and Dataframes maintain the same behavior.

@vbarua informed me that it may be possible that there are those depending on the builder directly. In which case, I can see an argument for maintaining the default true behavior...

I'm a new contributor to this project so I'm lacking context. I might lean on you folks to make a more informed decision here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards having the default behaviour be false for this, even if it's a breaking change, because it makes the builder less surprising IMO. Specifically, when invoking the builder for an aggregate with a specific set of grouping expressions, my expectation is that it should produce an aggregate with those specific grouping expressions. If I wanted additional grouping expressions, I would have included them.

There's definitely room and value for optimizations like what is going on here, but I think those need to be opt-in to avoid situation like this were the plan builder tries to be smart along one specific axis and inadvertently shoots you in the foot in another. In the past, I think we've leaned towards having the builder be as straightforward as possible and then handling optimizations in the optimizer.

Copy link
Contributor

@alamb alamb Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be less surprising, but I think we should consider it in a follow on PR. This PR fixes the issue with substrait and is a good change on its own. I have filed a follow on ticket to track the idea:

Update: I see this PR changes the defaults. I updated the title of the PR and will mark it as an API change

#[tokio::test]
async fn multilayer_aggregate() -> Result<()> {
assert_expected_plan(
"SELECT a, sum(partial_count_b) FROM (SELECT a, count(b) as partial_count_b FROM data GROUP BY a) GROUP BY a",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit confused by this. Why does the original plan created from SQL already contain the implicit groupBy's here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test goes SQL -> Logical Plan -> Substrait Plan -> Logical Plan.

There were no additional groupBys added between SQL -> Logical Plan -> Substrait Plan.

(Prior to this PR) Additional groupBys were added between Substrait Plan -> Logical Plan, violating the round trip plan expectations.

Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anlinc , great to get this fixed and the approach here looks good! Just a note on the default in LogicalPlanBuilderOptions - should it be "true" to not break potential other users?

And a question on one of the tests, just for my understanding 😄

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me overall. Thanks for working on this @anlinc

@alamb alamb changed the title fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait fix(substrait): Do not add implicit groupBy expressions in LogicalPlanBuilder or when building logical plans from Substrait Feb 25, 2025
@alamb alamb added the api change Changes the API exposed to users of the crate label Feb 25, 2025
@alamb
Copy link
Contributor

alamb commented Feb 25, 2025

Thanks @anlinc and @Blizzara and @vbarua for the reviews

@alamb alamb merged commit fc2f9dd into apache:main Feb 25, 2025
25 checks passed
anlinc added a commit to DataDog/datafusion that referenced this pull request Mar 7, 2025
…anBuilder` or when building logical plans from Substrait (apache#14860)

* feat: add add_implicit_group_by_exprs option to logical plan builder

* fix: do not add implicity group by exprs in substrait path

* test: add substrait tests

* test: add builder option tests

* style: clippy errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[substrait] Synthetically added grouping expressions in Aggregates can cause mismatched output columns
4 participants