Question / request for Aggregate operation #42

mbasmanova · 2021-10-01T10:51:45Z

We'd like to be able to specify masks for individual aggregations and a boolean ignoreNullKeys for a grouping set.

Masks are input columns of type boolean which allow to mask out rows for individual aggregations, e.g. SELECT count(1) filter (where a > 10) FROM t.

ignoreNullKeys boolean flag allows to avoid unnecessary processing when an aggregation is followed by an inner join on the grouping keys. In this case, rows with nulls in grouping keys cannot possible match the join condition and therefore we'd like to skip aggregations for such groups.

CC: @jacques-n

mbasmanova · 2021-10-01T10:55:35Z

Also, would it make sense to add a "step" enum to aggregate operation: partial, final, intermediate or single? This would allow us to specify the type of input and result of the operation, e.g.

(from https://facebookincubator.github.io/velox/develop/aggregate-functions.html)

jacques-n · 2021-10-02T00:25:03Z

Totally agree for the concept of phase exposed. I've already outlined in the context of the aggregate functions but need to do the same with regards to bindings within aggregate. I'll review the phases you've shared here and evaluate how they encompass those currently outlined in our proposal for aggregate functions and post a patch that includes at least an initial sketch of the common aggregations shortly. Thanks!

Masks are input columns of type boolean which allow to mask out rows for individual aggregations, e.g. SELECT count(1) filter (where a > 10) FROM t.

If I understand the request correctly, this feels like syntactic-sugared variation of COUNT(CASE WHEN a > 10 then null else 1 end). Is that correct/fair? If that's the case why the need for a customization like this?

Another random thought is that one could define an extension function CONDITIONAL_COUNT(1, a > 10) or similar to provides this functionality. I'm just trying to figure out the specifics to why this should be expressed separately. Also, are you looking to add this to a particular physical aggregation (e.g. hash aggregation) or to the logical aggregation operation?

ignoreNullKeys boolean flag
This also sounds more like something you'd want to apply to a specific physical aggregation (e.g. hash aggregation), is that correct. Is your request across all grouping sets, per grouping set or per field per grouping set? For example, imagine if I have the grouping sets: [[city, state, country],[state]], what variations do you want to express (for example, if I get a null for state, I would figure I would exclude the record for the second grouping set. Would I exclude for the first when there are other grouping fields that are still populated?

- Remove aggregate expressions type from generalized expressions. (only allow aggregate expressions as root expressions for aggregation) - Update function mapping to support options - Remove named structs from type unions (should only be used in special places as root, not in arbitrary hierarchy) - Add project, join, fetch, aggregate, sort, set logical relational operations. - Introduce key scalar and aggregate functions in functions yaml. - Remove old extensions docs Address substrait-io#42, substrait-io#43, substrait-io#44

jacques-n · 2021-10-03T16:10:13Z

Also, would it make sense to add a "step" enum to aggregate operation: partial, final, intermediate or single?

@mbasmanova , in my most recent PR I've also proposed the following AggregationPhases:

substrait/binary/expression.proto

Line 88 in 1ca0bfb

enum AggregationPhase {

mbasmanova · 2021-10-04T13:14:34Z

@jacques-n Thank you. I'll check these out.

Updates to ideally support majority of tpch queries - Remove aggregate expressions type from generalized expressions. (only allow aggregate expressions as root expressions for aggregation) - Update function mapping to support options - Remove named structs from type unions (should only be used in special places as root, not in arbitrary hierarchy) - Add project, join, fetch, aggregate, sort, set logical relational operations. - Introduce key scalar and aggregate functions in functions yaml. - Remove old extensions docs - Add nullability handling and type parsing syntax. Address #42, #43, #44

jacques-n · 2021-11-25T21:01:39Z

Hey there, we've added support in aggregate measures for a masking filter per measure as part of #88. I believe that covers the majority of this ticket. For now, we propose using either a separate pre-filter for ignoreNulls or adding an AdvancedExtension that exposes that property in AggregateRel.

Closing this issue. Feel free to reopen!

…it-io#42) * Ignore trailing semicolon during sql parsing Co-authored-by: James Taylor <james@qack.io>

jacques-n mentioned this issue Nov 22, 2021

Clarify/solidify extensions #80

Merged

jacques-n closed this as completed Nov 25, 2021

rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023

Add --multistatement option to allow multiple sql statements (substra…

2dff962

…it-io#42) * Ignore trailing semicolon during sql parsing Co-authored-by: James Taylor <james@qack.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question / request for Aggregate operation #42

Question / request for Aggregate operation #42

mbasmanova commented Oct 1, 2021

mbasmanova commented Oct 1, 2021

jacques-n commented Oct 2, 2021

jacques-n commented Oct 3, 2021

mbasmanova commented Oct 4, 2021

jacques-n commented Nov 25, 2021

Question / request for Aggregate operation #42

Question / request for Aggregate operation #42

Comments

mbasmanova commented Oct 1, 2021

mbasmanova commented Oct 1, 2021

jacques-n commented Oct 2, 2021

jacques-n commented Oct 3, 2021

mbasmanova commented Oct 4, 2021

jacques-n commented Nov 25, 2021