Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: add rule to fold two grouping operators into one #49627

Merged
merged 1 commit into from
Jun 2, 2020

Conversation

DrewKimball
Copy link
Collaborator

Previously, the optimizer could not fold two grouping operators into
a single equivalent grouping operator.

This patch adds a rule that effects this transformation under the
following conditions:

  1. All of the outer aggregates are aggregating on the output columns
    of the inner aggregates.
  2. Every inner-outer aggregation pair can be replaced with an
    equivalent single aggregate.
  3. The inner grouping columns functionally determine the outer
    grouping columns.
  4. Both grouping operators are unordered.

As an example, the following query pairs are equivalent:

SELECT sum(t) FROM (SELECT sum(b) FROM ab GROUP BY a) AS g(t);
SELECT sum(b) FROM ab;

SELECT max(t) FROM (SELECT max(b) FROM ab GROUP BY a) AS g(t);
SELECT max(b) FROM ab;

SELECT sum_int(t) FROM (SELECT count(b) FROM ab GROUP BY a) AS g(t);
SELECT count(b) FROM ab;

This situation is rare in direct SQL queries, but can arise when
composing views and queries.

Release note (sql change): The optimizer can now fold two grouping
operators together when they are aggregating over functions like sum.

@DrewKimball DrewKimball requested a review from a team as a code owner May 28, 2020 03:50
@blathers-crl
Copy link

blathers-crl bot commented May 28, 2020

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the O-community Originated from the community label May 28, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I left a few questions for my own understanding. :lgtm:

It's outside the scope of this PR, but I think there's another case that can be folded: when there is an outer aggregate function (like max or min) on the grouped inner column. For example:

SELECT max(a), sum(b) FROM (SELECT x, sum(y) FROM r GROUP BY x) AS s(a, b)
=>
SELECT max(x), sum(y) FROM r

Not sure if this case is worth handling in the future, but thought I'd point it out.

Reviewed 1 of 7 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/operator.go, line 374 at r1 (raw file):

	case CountOp, CountRowsOp:
		return outer == SumIntOp

So you're only checking for SumIntOp here because outer could never be a SumOp because CountOp and CountRowsOp always result in integers. Is my understanding correct?


pkg/sql/opt/norm/groupby_funcs.go, line 321 at r1 (raw file):

		innerAgg := innerAggs[i].Agg
		if !opt.IsAggregateOp(innerAgg) {
			return false

When would an innerAgg or outerAgg be something other than an AggregateOp?

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/operator.go, line 374 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

So you're only checking for SumIntOp here because outer could never be a SumOp because CountOp and CountRowsOp always result in integers. Is my understanding correct?

Yes, exactly.


pkg/sql/opt/norm/groupby_funcs.go, line 321 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

When would an innerAgg or outerAgg be something other than an AggregateOp?

They could be AggFilters or AggDistincts. These could be allowed in the future but I wanted to keep things simple for this PR.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I wonder what the motivation for this was, did you encounter a case like this?

Also, I want to commend you on your use of "effects" haha

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @mgartner)


pkg/sql/opt/norm/groupby_funcs.go, line 321 at r1 (raw file):

Previously, DrewKimball (Andrew Kimball) wrote…

They could be AggFilters or AggDistincts. These could be allowed in the future but I wanted to keep things simple for this PR.

It's worth adding a comment, mentioning AggFilters/AggDistinct (or checking for those particular ops) to make things more clear.


pkg/sql/opt/norm/rules/groupby.opt, line 378 at r1 (raw file):

# ignore orderings).
[FoldGroupingOperators, Normalize]
(GroupBy | ScalarGroupBy | DistinctOn

We document FirstAgg as only being used in DistinctOn. I don't think we can execbuild an expression that has a FirstAgg inside a GroupBy. Conversely, DistinctOn doesn't support anything other than ConstAgg and FirstAgg. I think we should remove FirstAgg and DistinctOn altogether from this rule.

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/norm/groupby_funcs.go, line 321 at r1 (raw file):

Previously, RaduBerinde wrote…

It's worth adding a comment, mentioning AggFilters/AggDistinct (or checking for those particular ops) to make things more clear.

Done.


pkg/sql/opt/norm/rules/groupby.opt, line 378 at r1 (raw file):

Previously, RaduBerinde wrote…

We document FirstAgg as only being used in DistinctOn. I don't think we can execbuild an expression that has a FirstAgg inside a GroupBy. Conversely, DistinctOn doesn't support anything other than ConstAgg and FirstAgg. I think we should remove FirstAgg and DistinctOn altogether from this rule.

Done.

@blathers-crl
Copy link

blathers-crl bot commented May 28, 2020

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@DrewKimball
Copy link
Collaborator Author

DrewKimball commented May 28, 2020

Nice work. I wonder what the motivation for this was, did you encounter a case like this?

Once the joins are removed from the query at line 593 in xform/testdata/external/trading, we end up with a ScalarGroupBy on top of a GroupBy.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@DrewKimball
Copy link
Collaborator Author

It's outside the scope of this PR, but I think there's another case that can be folded: when there is an outer aggregate function (like max or min) on the grouped inner column. For example:

SELECT max(a), sum(b) FROM (SELECT x, sum(y) FROM r GROUP BY x) AS s(a, b)
=>
SELECT max(x), sum(y) FROM r

Not sure if this case is worth handling in the future, but thought I'd point it out.

That's a good idea, I hadn't thought of that. Now that you mention it, I think you could also fold any outer aggregate on the grouped inner column if it has an AggDistinct.

@mgartner
Copy link
Collaborator

That's a good idea, I hadn't thought of that. Now that you mention it, I think you could also fold any outer aggregate on the grouped inner column if it has an AggDistinct.

I think that makes sense. Do you mean something like this:

SELECT count(DISTINCT a), sum(b) FROM (SELECT x, sum(y) FROM xy GROUP BY x) AS ab(a, b)
=>
SELECT count(DISTINCT x), sum(y) FROM xy

@DrewKimball
Copy link
Collaborator Author

I think that makes sense. Do you mean something like this:

SELECT count(DISTINCT a), sum(b) FROM (SELECT x, sum(y) FROM xy GROUP BY x) AS ab(a, b)
=>
SELECT count(DISTINCT x), sum(y) FROM xy

Yes, exactly. Although it's possible that there's some rule that would remove the AggDistinct.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a few nits/questions.

Reviewed 3 of 7 files at r1, 3 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/opt/operator.go, line 358 at r2 (raw file):

//
//   SELECT sum(s) FROM (SELECT sum(y) FROM xy GROUP BY x) AS f(s);
//   SELECT sum(y) FROM xyz;

[nit] xyz -> xy (here and below)


pkg/sql/opt/operator.go, line 374 at r2 (raw file):

	case CountOp, CountRowsOp:
		return outer == SumIntOp

Is there a reason it can't also be SumOp?


pkg/sql/opt/norm/groupby_funcs.go, line 357 at r2 (raw file):

// MergeAggs returns an AggregationsExpr that is equivalent to the two given
// AggregationsExprs.

[nit] add a note that this will panic if CanMergeAggs is false.


pkg/sql/opt/norm/groupby_funcs.go, line 370 at r2 (raw file):

		// For each outer aggregate, construct a new aggregate that takes the Agg
		// field of the referenced inner aggregate and the Col field of the outer
		// aggregate.

[nit] can you explain why you need the Agg from the inner and Col from the input to the outer?


pkg/sql/opt/norm/rules/groupby.opt, line 357 at r2 (raw file):

#   6 6 5
#
# Its functional dependencies: key(r), r->(a, b, c), a->(b, c)

c is not included in the table -- these FDs seem wrong


pkg/sql/opt/norm/testdata/rules/groupby, line 3023 at r2 (raw file):

# Case with a GroupBy on a GroupBy.
norm expect=FoldGroupingOperators
SELECT sum_int(s) FROM (SELECT y, sum_int(x) AS s FROM xy GROUP BY y) GROUP BY y

Do you have any tests where the inner and outer group by columns are different but the inner still determines the outer? How about a test with more than one group by column?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/operator.go, line 358 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] xyz -> xy (here and below)

Done.


pkg/sql/opt/operator.go, line 374 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is there a reason it can't also be SumOp?

CountOp and CountRowsOp always produce ints, while (I believe) SumOp produces a decimal number. So, replacing a SumOp on a CountOp would change the outputted type from decimal to int. I'll add a comment.


pkg/sql/opt/norm/groupby_funcs.go, line 357 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] add a note that this will panic if CanMergeAggs is false.

Done.


pkg/sql/opt/norm/groupby_funcs.go, line 370 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] can you explain why you need the Agg from the inner and Col from the input to the outer?

Done.


pkg/sql/opt/norm/rules/groupby.opt, line 357 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

c is not included in the table -- these FDs seem wrong

Right, I forgot to prune those after removing the c column from the table. Done.


pkg/sql/opt/norm/testdata/rules/groupby, line 3023 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Do you have any tests where the inner and outer group by columns are different but the inner still determines the outer? How about a test with more than one group by column?

Done.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)


pkg/sql/opt/operator.go, line 374 at r2 (raw file):

Previously, DrewKimball (Andrew Kimball) wrote…

CountOp and CountRowsOp always produce ints, while (I believe) SumOp produces a decimal number. So, replacing a SumOp on a CountOp would change the outputted type from decimal to int. I'll add a comment.

Doh. I left the same question yesterday... I should have suggested a comment then. This is definitely comment-worthy, so thanks for adding it!

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/opt/norm/groupby_funcs.go, line 373 at r3 (raw file):

		// every inner-outer aggregate pair forms a valid decomposition for the
		// inner aggregate. The column from the outer aggregate has to be used to
		// preserve logical equivalency.

It would also help to add a reminder that you use the inner Agg field since both inner and outer are the same in most cases, but for count/count rows you need the inner. You can also add a reference to see opt.AggregatesCanMerge for details.

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 1, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)


pkg/sql/opt/norm/groupby_funcs.go, line 373 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It would also help to add a reminder that you use the inner Agg field since both inner and outer are the same in most cases, but for count/count rows you need the inner. You can also add a reference to see opt.AggregatesCanMerge for details.

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball)


pkg/sql/opt/norm/groupby_funcs.go, line 374 at r4 (raw file):

		// inner aggregate. In most cases, the inner and outer aggregates are the
		// same, but in the count and count-rows cases the inner aggregate must
		// be used (see opt.aggregatesCanMerge for details). The column from the

[nit] opt.aggregatesCanMerge -> opt.AggregatesCanMerge

Previously, the optimizer could not fold two grouping operators into
a single equivalent grouping operator.

This patch adds a rule that effects this transformation under the
following conditions:
1. All of the outer aggregates are aggregating on the output columns
   of the inner aggregates.
2. Every inner-outer aggregation pair can be replaced with an
   equivalent single aggregate.
3. The inner grouping columns functionally determine the outer
   grouping columns.
4. Both grouping operators are unordered.

As an example, the following query pairs are equivalent:
```
SELECT sum(t) FROM (SELECT sum(b) FROM ab GROUP BY a) AS g(t);
SELECT sum(b) FROM ab;

SELECT max(t) FROM (SELECT max(b) FROM ab GROUP BY a) AS g(t);
SELECT max(b) FROM ab;

SELECT sum_int(t) FROM (SELECT count(b) FROM ab GROUP BY a) AS g(t);
SELECT count(b) FROM ab;
```
This situation is rare in direct SQL queries, but can arise when
composing views and queries.

Release note (sql change): The optimizer can now fold two grouping
operators together when they are aggregating over functions like sum.
Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)


pkg/sql/opt/norm/groupby_funcs.go, line 374 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] opt.aggregatesCanMerge -> opt.AggregatesCanMerge

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)

@DrewKimball
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 2, 2020

Build succeeded

@craig craig bot merged commit d7a0a90 into cockroachdb:master Jun 2, 2020
@DrewKimball DrewKimball deleted the fold_groupby branch June 18, 2020 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants