Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec: add distinct aggregation support #39242

Closed
jordanlewis opened this issue Aug 1, 2019 · 3 comments · Fixed by #53145
Closed

exec: add distinct aggregation support #39242

jordanlewis opened this issue Aug 1, 2019 · 3 comments · Fixed by #53145
Assignees
Labels
A-sql-vec SQL vectorized engine C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@jordanlewis
Copy link
Member

jordanlewis commented Aug 1, 2019

Distinct aggregates only operate over distinct values of the grouping columns.

Needed for TPC-H query 16.

	// Aggregation functions with distinct = true functions like you would
	// expect '<FUNC> DISTINCT' to operate, the default behavior would be
	// the '<FUNC> ALL' operation.
	Distinct bool `protobuf:"varint,2,opt,name=distinct" json:"distinct"`
@jordanlewis jordanlewis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-vec SQL vectorized engine labels Aug 1, 2019
@jordanlewis
Copy link
Member Author

I think this is relatively important, especially the count(distinct()) case that came up in TPC-H - it's also important for a crucial geospatial use case.

@awoods187
Copy link
Contributor

Is this something we should promote up in importance for 20.2 or is it covered in a work item already?

@jordanlewis
Copy link
Member Author

I think that we'll actually be covered here by #46899, which will be easier and also touches the general case.

@yuzefovich yuzefovich self-assigned this Jun 6, 2020
craig bot pushed a commit that referenced this issue Aug 20, 2020
50721: colexec: add support for DISTINCT and FILTER hash aggregation r=yuzefovich a=yuzefovich

This commit adds the support of DISTINCT and FILTERing hash aggregation.
The approach is as follows:
- to handle FILTER we run a selection operator on the input state
- to handle DISTINCT we encode aggregation columns, one tuple at a time,
and update the selection vector to include tuples we haven't yet seen
- then we run the aggregation on the remaining selected tuples
- and then restore the state with the original length and selection
vector.

Such handling of FILTER clause sounds good to me, but the handling of
DISTINCT is somewhat unfortunate: we perform encoding one tuple at
a time.
Other approaches have been prototyped but showed worse performance:
- using the vectorized hash table - the benefit of such approach is that we
don't reduce ourselves to one tuple at a time (because we would be hashing
the full columns at once), but the big disadvantage is that the full tuples
are stored in the hash table (instead of an encoded representation)
- using a single global map for a particular aggregate function that is
shared among all aggregation groups - the benefit of such approach is that
we only have a handful of map, but it turned out that such global map grows
a lot bigger and has worse performance.

Addresses: #39241.
Addresses: #39242.

Release note (sql change): Vectorized execution engine now natively
supports DISTINCT and FILTERing hash aggregation.

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@craig craig bot closed this as completed in cab0b31 Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-vec SQL vectorized engine C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants