exec: add distinct aggregation support #39242

jordanlewis · 2019-08-01T18:43:41Z

Distinct aggregates only operate over distinct values of the grouping columns.

Needed for TPC-H query 16.

	// Aggregation functions with distinct = true functions like you would
	// expect '<FUNC> DISTINCT' to operate, the default behavior would be
	// the '<FUNC> ALL' operation.
	Distinct bool `protobuf:"varint,2,opt,name=distinct" json:"distinct"`

The text was updated successfully, but these errors were encountered:

jordanlewis · 2020-04-01T03:24:57Z

I think this is relatively important, especially the count(distinct()) case that came up in TPC-H - it's also important for a crucial geospatial use case.

awoods187 · 2020-04-01T22:37:27Z

Is this something we should promote up in importance for 20.2 or is it covered in a work item already?

jordanlewis · 2020-04-02T01:07:30Z

I think that we'll actually be covered here by #46899, which will be easier and also touches the general case.

50721: colexec: add support for DISTINCT and FILTER hash aggregation r=yuzefovich a=yuzefovich This commit adds the support of DISTINCT and FILTERing hash aggregation. The approach is as follows: - to handle FILTER we run a selection operator on the input state - to handle DISTINCT we encode aggregation columns, one tuple at a time, and update the selection vector to include tuples we haven't yet seen - then we run the aggregation on the remaining selected tuples - and then restore the state with the original length and selection vector. Such handling of FILTER clause sounds good to me, but the handling of DISTINCT is somewhat unfortunate: we perform encoding one tuple at a time. Other approaches have been prototyped but showed worse performance: - using the vectorized hash table - the benefit of such approach is that we don't reduce ourselves to one tuple at a time (because we would be hashing the full columns at once), but the big disadvantage is that the full tuples are stored in the hash table (instead of an encoded representation) - using a single global map for a particular aggregate function that is shared among all aggregation groups - the benefit of such approach is that we only have a handful of map, but it turned out that such global map grows a lot bigger and has worse performance. Addresses: #39241. Addresses: #39242. Release note (sql change): Vectorized execution engine now natively supports DISTINCT and FILTERing hash aggregation. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

jordanlewis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-vec SQL vectorized engine labels Aug 1, 2019

jordanlewis mentioned this issue Aug 15, 2019

exec: tracking issue for TPC-H queries #34497

Closed

37 tasks

yuzefovich self-assigned this Jun 6, 2020

yuzefovich mentioned this issue Jun 26, 2020

colexec: add support for DISTINCT and FILTER hash aggregation #50721

Merged

yuzefovich mentioned this issue Aug 20, 2020

colexec: add support for DISTINCT ordered aggregation #53145

Merged

craig bot closed this as completed in cab0b31 Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec: add distinct aggregation support #39242

exec: add distinct aggregation support #39242

jordanlewis commented Aug 1, 2019 •

edited by rafiss

Loading

jordanlewis commented Apr 1, 2020

awoods187 commented Apr 1, 2020

jordanlewis commented Apr 2, 2020

exec: add distinct aggregation support #39242

exec: add distinct aggregation support #39242

Comments

jordanlewis commented Aug 1, 2019 • edited by rafiss Loading

jordanlewis commented Apr 1, 2020

awoods187 commented Apr 1, 2020

jordanlewis commented Apr 2, 2020

jordanlewis commented Aug 1, 2019 •

edited by rafiss

Loading