Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation into aggregates, supported operators and multiple enumerable parameters #27948

Closed
roji opened this issue May 4, 2022 · 3 comments · Fixed by #27969
Closed

Investigation into aggregates, supported operators and multiple enumerable parameters #27948

roji opened this issue May 4, 2022 · 3 comments · Fixed by #27969
Assignees
Labels
area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@roji
Copy link
Member

roji commented May 4, 2022

I did some further investigation into what's supported across databases in terms of aggregates, operators (distinct/ordering/filtering) and multiple enumerable parameters; this is to help inform @smitpatel's work on custom aggregate support (see #27931, #27935).

tl;dr:

  • Aggregate functions with multiple enumerable parameters are a very rare thing. The only example I'm seeing is creating a JSON document from two columns, which function as the keys and values (PG jsonb_object_agg, MySQL JSON_ARRAYAGG, SQLite json_group_array).
  • Operators are always applied at the input row level, and never at the individual argument level. In other words, I haven't found a case where one can specify different distinct/order by/filtering on different single enumerable arguments. The syntax for specifying the operators frequently conveys this, e.g. SQL Server ordering and PG filtering are outside the aggregate function's parentheses.
  • This means that although it is possible to express different operators in LINQ, AFAICT this won't be supported in SQL (providers will have to detect this and throw).
  • Not sure this has concrete implications for us. One possible action from this could be to for the relational query pipeline to extract operators from SqlEnumerableExpression parameters, ensure they're identical, and pass them to the aggregate method translator as separate parameters.
    • This could be considered slightly "cleaner", since providers see the operators at the level where they actually operate in SQL, i.e. at the function level, not at the argument level.
    • As in Query: Consider having predicate on SqlEnumerableExpression so providers can integrate it accordingly #27935, this would require translators to deal with the predicate explicitly, i.e. do a CASE/WHEN, copy it to their custom aggregate function expression or whatever.
    • Given how few multiarg functions there are, this probably isn't very critical (most functions have a single enumerable arg, so exactly where the operators are doesn't matter that much).
  • Writing an aggregate method translator will definitely be less trivial than a regular method translator :)
PostgreSQL
DROP TABLE IF EXISTS data;
CREATE TABLE data
(
  id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  jkey TEXT,
  jval TEXT
);

INSERT INTO data (jkey, jval) VALUES ('foo1', 'bar1'), ('foo2', 'bar2'), ('foo3', 'bar3'), ('foo1', 'bar1');

-- Distinct, ordering and filtering in regular scenario (single enumerable parameter)
SELECT jsonb_agg(DISTINCT jkey) FROM data;
SELECT jsonb_agg(jkey ORDER BY jkey DESC) FROM data;
SELECT jsonb_agg(jkey) FILTER (WHERE jval NOT LIKE '%3') FROM data;

-- Multiple enumerable arguments
-- Docs for json_object_agg: https://www.postgresql.org/docs/current/functions-aggregate.html
SELECT json_object_agg(jkey, jval) FROM data;

-- Only one ORDER BY is supported, applying to the row, not to each enumerable argument:
SELECT json_object_agg(jkey, jval ORDER BY jkey DESC) FROM data;
-- ORDER BY on a specific argument is not supported:
SELECT json_object_agg(jkey ORDER BY jkey DESC, jval) FROM data;

-- Same with distinct: only one DISTINCT is allowed, applied to the row as a whole, at the start of the parentheses:
SELECT json_object_agg(DISTINCT jkey, jval) FROM data;
-- DISTINCT on a specific argument is not supported:
SELECT json_object_agg(jkey, DISTINCT jval) FROM data;

-- Same with filtering: the FILTER clause is specified outside the parentheses:
SELECT json_object_agg(jkey, jval) FILTER (WHERE jval NOT LIKE '%3') FROM data;

-- It may be possible to apply different filters/distincts on different arguments via a subquery pushdown (i.e.
-- apply the aggregate function to the output of a subquery)
MySQL
DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    id INT PRIMARY KEY AUTO_INCREMENT,
    jkey VARCHAR(30),
    jval VARCHAR(30)
);

INSERT INTO data (jkey, jval) VALUES ('foo1', 'bar1'), ('foo2', 'bar2'), ('foo3', 'bar3'), ('foo1', 'bar1');

-- Distinct and ordering in regular scenario (single enumerable parameter). Filtering apparently not supported.
SELECT GROUP_CONCAT(DISTINCT jkey) FROM data;
SELECT GROUP_CONCAT(jkey ORDER BY jkey DESC) FROM data;

-- Multiple enumerable arguments
-- Docs: https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_json-arrayagg
SELECT JSON_OBJECTAGG(jkey, jval) FROM data;

-- DISTINCT and ORDER BY not supported:
SELECT JSON_OBJECTAGG(DISTINCT jkey, jval) FROM data;
SELECT JSON_OBJECTAGG(jkey, jval ORDER BY jval) FROM data;
SQL Server
DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    id INT PRIMARY KEY IDENTITY,
    jkey VARCHAR(MAX),
    jval VARCHAR(MAX)
);

INSERT INTO data (jkey, jval) VALUES ('foo1', 'bar1'), ('foo2', 'bar2'), ('foo3', 'bar3'), ('foo1', 'bar1');

-- Docs: https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql
SELECT STRING_AGG(jkey, ',') FROM data;
-- With ordering:
SELECT STRING_AGG(jkey, ',') WITHIN GROUP (ORDER BY jkey DESC) FROM data;

-- Distinct supported in regular cases:
SELECT MAX(DISTINCT jkey) FROM data;
-- But not with STRING_AGG (because multiple parameters?):
SELECT STRING_AGG(DISTINCT jkey, ',') FROM data;

-- No JSON aggregate functions, no functions with multiple enumerable parameters
-- (docs: https://docs.microsoft.com/en-us/sql/t-sql/functions/aggregate-functions-transact-sql)
SQLite
DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    jkey TEXT,
    jval TEXT
);

INSERT INTO data (jkey, jval) VALUES ('foo1', 'bar1'), ('foo2', 'bar2'), ('foo3', 'bar3'), ('foo1', 'bar1');

-- Docs: https://www.sqlite.org/json1.html#jgroupobject

-- Distinct works for single enumerable parameter, ordering/filtering apparently not supported
SELECT json_group_array(DISTINCT jkey) FROM data;

-- Distinct not supported for multiple enumerable parameters:
SELECT json_group_object(DISTINCT jkey, jval) FROM data; -- Error: DISTINCT aggregates must have exactly one argument
Oracle

Oracle also has JSON_ARRAY_AGG (single enumerable parameter) and JSON_OBJECTAGG (multiple). The former supports an order by clause, the latter does not.

/cc @lauxjpn

@smitpatel
Copy link
Contributor

Note: Add exception message for ordering in RelationalQuerySqlGenerator.

@ajcvickers ajcvickers added this to the 7.0.0 milestone May 6, 2022
@smitpatel smitpatel added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label May 6, 2022
smitpatel added a commit that referenced this issue May 6, 2022
This iterates over design in #27931

- Bring back DistinctExpression (yes the token in invalid in some places and incorrect tree will throw invalid SQL error in database
- Introduce EnumerableExpression which is a holder/parameter object which contains facets of grouping element chain including Distinct/Predicate/Selector/Ordering
- Translators are responsible for putting together pieces from EnumerableExpression to generate a SqlExpression as a result
- Remove GroupByAggregateChainProcessor which failed to avoid double visitation. We need to refactor this code in future to avoid it when we implement public API for aggregate functions

Resolves #27948
Resolves #27935
smitpatel added a commit that referenced this issue May 6, 2022
This iterates over design in #27931

- Bring back DistinctExpression (yes the token in invalid in some places and incorrect tree will throw invalid SQL error in database
- Introduce EnumerableExpression which is a holder/parameter object which contains facets of grouping element chain including Distinct/Predicate/Selector/Ordering
- Translators are responsible for putting together pieces from EnumerableExpression to generate a SqlExpression as a result
- Remove GroupByAggregateChainProcessor which failed to avoid double visitation. We need to refactor this code in future to avoid it when we implement public API for aggregate functions

Resolves #27948
Resolves #27935
smitpatel added a commit that referenced this issue May 10, 2022
This iterates over design in #27931

- Bring back DistinctExpression (yes the token in invalid in some places and incorrect tree will throw invalid SQL error in database
- Introduce EnumerableExpression which is a holder/parameter object which contains facets of grouping element chain including Distinct/Predicate/Selector/Ordering
- Translators are responsible for putting together pieces from EnumerableExpression to generate a SqlExpression as a result
- Remove GroupByAggregateChainProcessor which failed to avoid double visitation. We need to refactor this code in future to avoid it when we implement public API for aggregate functions

Resolves #27948
Resolves #27935
smitpatel added a commit that referenced this issue May 10, 2022
This iterates over design in #27931

- Bring back DistinctExpression (yes the token in invalid in some places and incorrect tree will throw invalid SQL error in database
- Introduce EnumerableExpression which is a holder/parameter object which contains facets of grouping element chain including Distinct/Predicate/Selector/Ordering
- Translators are responsible for putting together pieces from EnumerableExpression to generate a SqlExpression as a result
- Remove GroupByAggregateChainProcessor which failed to avoid double visitation. We need to refactor this code in future to avoid it when we implement public API for aggregate functions

Resolves #27948
Resolves #27935
@lauxjpn
Copy link
Contributor

lauxjpn commented May 13, 2022

@roji Do providers need to take steps to support this, or is this more of an EF Core internal spring cleaning?

@smitpatel
Copy link
Contributor

@lauxjpn - Currently, this has just created infrastructure to support aggregate operators. Existing operations - Min/Max/Sum/Average/Count/LongCount, are modified to use this system but still convert to the form we were passing to providers before so it should continue to work. Full support for custom operations is being tracked #22957 which will utilize above infra. So if a provider is doing something specific wrt those 6 default operations, they would need to react.

@ajcvickers ajcvickers modified the milestones: 7.0.0, 7.0.0-preview5 May 25, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0-preview5, 7.0.0 Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants