Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ordering sensitive aggregation #6332

Merged
merged 30 commits into from
May 15, 2023
Merged

Add support for ordering sensitive aggregation #6332

merged 30 commits into from
May 15, 2023

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented May 11, 2023

Which issue does this PR close?

Closes #6189.

Rationale for this change

Currently datafusion has support for queries as below

SELECT (ARRAY_AGG(s.amount)) AS amounts
            FROM sales_global AS s
            GROUP BY s.ts

However, when aggregation queries expect their input to have an ordering, queries fail. As an example see the query below.

SELECT (ARRAY_AGG(s.amount ORDER BY s.sn DESC)) AS amounts
            FROM sales_global AS s
            GROUP BY s.ts

Above query can be parsed successfully, however it cannot be executed. With this PR we add support for queries in the form above, where aggregate function requires an ordering for its input

What changes are included in this PR?

In this PR

  • We add support for ordering sensitive aggregate functions.
  • We refactored create_physical_sort_expr function.

Are these changes tested?

Yes new tests are added to the aggregate.slt file.

Are there any user-facing changes?

mustafasrepo and others added 22 commits May 3, 2023 14:46
# Conflicts:
#	datafusion/core/tests/sqllogictests/test_files/aggregate.slt
# Conflicts:
#	datafusion/core/src/physical_plan/planner.rs
#	datafusion/core/tests/sqllogictests/test_files/aggregate.slt
#	datafusion/expr/src/expr.rs
#	datafusion/expr/src/tree_node/expr.rs
#	datafusion/expr/src/udaf.rs
#	datafusion/optimizer/src/analyzer/type_coercion.rs
#	datafusion/optimizer/src/common_subexpr_eliminate.rs
#	datafusion/proto/src/logical_plan/from_proto.rs
#	datafusion/proto/src/logical_plan/mod.rs
#	datafusion/proto/src/logical_plan/to_proto.rs
#	datafusion/sql/src/expr/function.rs
#	datafusion/sql/src/utils.rs
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate labels May 11, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo

I will try and review this PR over the next day or two.

I think we should document this feature as well in https://arrow.apache.org/datafusion/user-guide/sql/select.html#group-by-clause if possible

I also noticed that window functions are not actually documented yet, so I filed #6338 to track that

@mustafasrepo
Copy link
Contributor Author

Thank you @mustafasrepo

I will try and review this PR over the next day or two.

I think we should document this feature as well in https://arrow.apache.org/datafusion/user-guide/sql/select.html#group-by-clause if possible

I also noticed that window functions are not actually documented yet, so I filed #6338 to track that

Thanks @alamb, I wasn't aware of the documentation part. I have updated the documentation at https://arrow.apache.org/datafusion/user-guide/sql/select.html#group-by-clause and https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html#array-agg

@alamb alamb added the api change Changes the API exposed to users of the crate label May 12, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo -- this looks really nice

I especially like how ordering is expressed as an updated "required_input_ordering" and then the rest of the datafusion planning machinery handles generating and optimizing the actual sorting. Very nice 👏

TUR [75.0, 100.0]

# test_ordering_sensitive_aggregation2
# We should be able to satisfy the finest requirement among all aggregators, when we have multiple aggregators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies that an aggregate with no sort and a single expr works

I think it would be valuable to also test more than one column, something like:

ARRAY_AGG(s.amount ORDER BY s.amount DESC, ts ASC) AS amounts,
SUM(s.amount ORDER BY s.amount DESC) AS sum1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we cannot support these kind of queries (because we cannot parse multiple requirement, PR in the sqlparser fixes the problem. However, it is not merged yet, I am not sure when it will be available to datafusion.). However, once this support is added. This case should be fine. I have added a test for this use case, we currently generate an error, once this is added. We can change the test.

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@alamb alamb merged commit d8a92be into apache:main May 15, 2023
@alamb
Copy link
Contributor

alamb commented May 15, 2023

Thanks again @mustafasrepo

@mustafasrepo mustafasrepo deleted the feature/ordering_sensitive_aggregation branch July 21, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for ordered aggregation
3 participants