Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-aggregation expressions should be included in the grouping keys #11903

Closed
goldmedal opened this issue Aug 9, 2024 · 2 comments
Closed

Non-aggregation expressions should be included in the grouping keys #11903

goldmedal opened this issue Aug 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@goldmedal
Copy link
Contributor

goldmedal commented Aug 9, 2024

Describe the bug

I noticed that the behavior of some tests is unusual.

SELECT r.sn, r.amount, SUM(r.amount)
FROM (SELECT *
FROM sales_global_with_pk as l
LEFT JOIN sales_global_with_pk as r
ON l.amount >= r.amount + 10)
GROUP BY r.sn
ORDER BY r.sn

Typically, non-aggregation expressions should appear in the GROUP BY clause. Other databases do not allow this behavior.

DuckDB

D CREATE TABLE sales_global_with_pk (zip_code INT,
            country VARCHAR(3),
            sn INT,
            ts TIMESTAMP,
            currency VARCHAR(3),
            amount FLOAT,
            primary key(sn)
          );
D SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10) r
    GROUP BY r.sn
  ORDER BY r.sn;
Binder Error: column "amount" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(amount)" if the exact value of "amount" is not important.
LINE 1: SELECT r.sn, r.amount, SUM(r.amount)
                     ^

Postgres

test= SELECT x.sn, x.amount, SUM(x.amount)                
  FROM (SELECT l.sn, l.amount
    FROM sales_global_with_pk as l
    LEFT JOIN sales_global_with_pk as r
    ON l.amount >= r.amount + 10) x
  GROUP BY x.sn  
ORDER BY x.sn;
ERROR:  column "x.amount" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT x.sn, x.amount, SUM(x.amount)

I modified the SQL to fix the scoping issue. However, both DuckDB and Postgres require that the dimensions (sn and amount) appear in the GROUP BY clause. In my experience, most databases follow similar behavior.

To Reproduce

As the above.

Expected behavior

This case should fail:

SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10)
    GROUP BY r.sn
  ORDER BY r.sn;

We should provide all non-aggregation expressions in the group-by clause

SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10)
    GROUP BY r.sn, r.amount
  ORDER BY r.sn;

Additional context

I'm working on #11681 now. I guess it can also fix this issue partially. However, another case as below won't be fixed.

SELECT r.sn, r.amount, SUM(r.amount)
  FROM (SELECT r.sn, r.amount
    FROM sales_global_with_pk as l
    LEFT JOIN sales_global_with_pk as r
    ON l.amount >= r.amount + 10)
  GROUP BY sn
ORDER BY r.sn
@goldmedal goldmedal added the bug Something isn't working label Aug 9, 2024
@goldmedal
Copy link
Contributor Author

goldmedal commented Aug 9, 2024

I did some tests. If I don't provide any group-by key, the SQL will fail.

> select sn, amount, sum(amount) from sales_global_with_pk;
Error during planning: Projection references non-aggregate values: Expression sales_global_with_pk.sn could not be resolved from available columns: sum(sales_global_with_pk.amount)

However, if I provide at least 1 group-by key, DataFusion will add the others to the group-by key implicitly.

> select sn, amount, sum(amount) from sales_global_with_pk group by sn;
+----+--------+----------------------------------+
| sn | amount | sum(sales_global_with_pk.amount) |
+----+--------+----------------------------------+
| 1  | 50.0   | 50.0                             |
| 4  | 100.0  | 100.0                            |
| 0  | 30.0   | 30.0                             |
| 3  | 200.0  | 200.0                            |
| 2  | 75.0   | 75.0                             |
+----+--------+----------------------------------+
5 row(s) fetched. 
Elapsed 0.005 seconds.

> explain select sn, amount, sum(amount) from sales_global_with_pk group by sn;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                    |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[sales_global_with_pk.sn, sales_global_with_pk.amount]], aggr=[[sum(CAST(sales_global_with_pk.amount AS Float64))]] |
|               |   TableScan: sales_global_with_pk projection=[sn, amount]                                                                               |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[sn@0 as sn, amount@1 as amount], aggr=[sum(sales_global_with_pk.amount)]                     |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                                           |
|               |     RepartitionExec: partitioning=Hash([sn@0, amount@1], 8), input_partitions=8                                                         |
|               |       RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                              |
|               |         AggregateExec: mode=Partial, gby=[sn@0 as sn, amount@1 as amount], aggr=[sum(sales_global_with_pk.amount)]                      |
|               |           MemoryExec: partitions=1, partition_sizes=[1]                                                                                 |
|               |                                                                                                                                         |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.002 seconds.

@goldmedal
Copy link
Contributor Author

hmm.. Ok, I found #6190 mentioned it's an optimization for the primary key and unique key. I think it may not be an issue. I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant