Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question / request for Aggregate operation #42

Closed
mbasmanova opened this issue Oct 1, 2021 · 5 comments
Closed

Question / request for Aggregate operation #42

mbasmanova opened this issue Oct 1, 2021 · 5 comments

Comments

@mbasmanova
Copy link

We'd like to be able to specify masks for individual aggregations and a boolean ignoreNullKeys for a grouping set.

Masks are input columns of type boolean which allow to mask out rows for individual aggregations, e.g. SELECT count(1) filter (where a > 10) FROM t.

ignoreNullKeys boolean flag allows to avoid unnecessary processing when an aggregation is followed by an inner join on the grouping keys. In this case, rows with nulls in grouping keys cannot possible match the join condition and therefore we'd like to skip aggregations for such groups.

CC: @jacques-n

@mbasmanova
Copy link
Author

Also, would it make sense to add a "step" enum to aggregate operation: partial, final, intermediate or single? This would allow us to specify the type of input and result of the operation, e.g.

Screen Shot 2021-10-01 at 6 55 01 AM

(from https://facebookincubator.github.io/velox/develop/aggregate-functions.html)

@jacques-n
Copy link
Contributor

Totally agree for the concept of phase exposed. I've already outlined in the context of the aggregate functions but need to do the same with regards to bindings within aggregate. I'll review the phases you've shared here and evaluate how they encompass those currently outlined in our proposal for aggregate functions and post a patch that includes at least an initial sketch of the common aggregations shortly. Thanks!

Masks are input columns of type boolean which allow to mask out rows for individual aggregations, e.g. SELECT count(1) filter (where a > 10) FROM t.

If I understand the request correctly, this feels like syntactic-sugared variation of COUNT(CASE WHEN a > 10 then null else 1 end). Is that correct/fair? If that's the case why the need for a customization like this?

Another random thought is that one could define an extension function CONDITIONAL_COUNT(1, a > 10) or similar to provides this functionality. I'm just trying to figure out the specifics to why this should be expressed separately. Also, are you looking to add this to a particular physical aggregation (e.g. hash aggregation) or to the logical aggregation operation?

ignoreNullKeys boolean flag
This also sounds more like something you'd want to apply to a specific physical aggregation (e.g. hash aggregation), is that correct. Is your request across all grouping sets, per grouping set or per field per grouping set? For example, imagine if I have the grouping sets: [[city, state, country],[state]], what variations do you want to express (for example, if I get a null for state, I would figure I would exclude the record for the second grouping set. Would I exclude for the first when there are other grouping fields that are still populated?

jacques-n added a commit to jacques-n/substrait that referenced this issue Oct 3, 2021
- Remove aggregate expressions type from generalized expressions. (only allow aggregate expressions as root expressions for aggregation)
- Update function mapping to support options
- Remove named structs from type unions (should only be used in special places as root, not in arbitrary hierarchy)
- Add project, join, fetch, aggregate, sort, set logical relational operations.
- Introduce key scalar and aggregate functions in functions yaml.
- Remove old extensions docs

Address substrait-io#42, substrait-io#43, substrait-io#44
@jacques-n
Copy link
Contributor

Also, would it make sense to add a "step" enum to aggregate operation: partial, final, intermediate or single?

@mbasmanova , in my most recent PR I've also proposed the following AggregationPhases:

enum AggregationPhase {

@mbasmanova
Copy link
Author

@jacques-n Thank you. I'll check these out.

jacques-n added a commit that referenced this issue Oct 5, 2021
Updates to ideally support majority of tpch queries

- Remove aggregate expressions type from generalized expressions. (only allow aggregate expressions as root expressions for aggregation)
- Update function mapping to support options
- Remove named structs from type unions (should only be used in special places as root, not in arbitrary hierarchy)
- Add project, join, fetch, aggregate, sort, set logical relational operations.
- Introduce key scalar and aggregate functions in functions yaml.
- Remove old extensions docs
- Add nullability handling and type parsing syntax.

Address #42, #43, #44
@jacques-n
Copy link
Contributor

Hey there, we've added support in aggregate measures for a masking filter per measure as part of #88. I believe that covers the majority of this ticket. For now, we propose using either a separate pre-filter for ignoreNulls or adding an AdvancedExtension that exposes that property in AggregateRel.

Closing this issue. Feel free to reopen!

rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023
…it-io#42)

* Ignore trailing semicolon during sql parsing

Co-authored-by: James Taylor <james@qack.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants