Non-aggregation expressions should be included in the grouping keys #11903

goldmedal · 2024-08-09T14:47:47Z

Describe the bug

I noticed that the behavior of some tests is unusual.

datafusion/datafusion/sqllogictest/test_files/group_by.slt

Lines 3481 to 3487 in b5d7931

    
           SELECT r.sn, r.amount, SUM(r.amount) 
        
             FROM (SELECT * 
        
               FROM sales_global_with_pk as l 
        
               LEFT JOIN sales_global_with_pk as r 
        
               ON l.amount >= r.amount + 10) 
        
             GROUP BY r.sn 
        
           ORDER BY r.sn

Typically, non-aggregation expressions should appear in the GROUP BY clause. Other databases do not allow this behavior.

DuckDB

D CREATE TABLE sales_global_with_pk (zip_code INT,
            country VARCHAR(3),
            sn INT,
            ts TIMESTAMP,
            currency VARCHAR(3),
            amount FLOAT,
            primary key(sn)
          );
D SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10) r
    GROUP BY r.sn
  ORDER BY r.sn;
Binder Error: column "amount" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(amount)" if the exact value of "amount" is not important.
LINE 1: SELECT r.sn, r.amount, SUM(r.amount)
                     ^

Postgres

test= SELECT x.sn, x.amount, SUM(x.amount)                
  FROM (SELECT l.sn, l.amount
    FROM sales_global_with_pk as l
    LEFT JOIN sales_global_with_pk as r
    ON l.amount >= r.amount + 10) x
  GROUP BY x.sn  
ORDER BY x.sn;
ERROR:  column "x.amount" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT x.sn, x.amount, SUM(x.amount)

I modified the SQL to fix the scoping issue. However, both DuckDB and Postgres require that the dimensions (sn and amount) appear in the GROUP BY clause. In my experience, most databases follow similar behavior.

To Reproduce

As the above.

Expected behavior

This case should fail:

SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10)
    GROUP BY r.sn
  ORDER BY r.sn;

We should provide all non-aggregation expressions in the group-by clause

SELECT r.sn, r.amount, SUM(r.amount)
    FROM (SELECT *
      FROM sales_global_with_pk as l
      LEFT JOIN sales_global_with_pk as r
      ON l.amount >= r.amount + 10)
    GROUP BY r.sn, r.amount
  ORDER BY r.sn;

Additional context

I'm working on #11681 now. I guess it can also fix this issue partially. However, another case as below won't be fixed.

SELECT r.sn, r.amount, SUM(r.amount)
  FROM (SELECT r.sn, r.amount
    FROM sales_global_with_pk as l
    LEFT JOIN sales_global_with_pk as r
    ON l.amount >= r.amount + 10)
  GROUP BY sn
ORDER BY r.sn

The text was updated successfully, but these errors were encountered:

goldmedal · 2024-08-09T15:10:57Z

I did some tests. If I don't provide any group-by key, the SQL will fail.

> select sn, amount, sum(amount) from sales_global_with_pk;
Error during planning: Projection references non-aggregate values: Expression sales_global_with_pk.sn could not be resolved from available columns: sum(sales_global_with_pk.amount)

However, if I provide at least 1 group-by key, DataFusion will add the others to the group-by key implicitly.

> select sn, amount, sum(amount) from sales_global_with_pk group by sn;
+----+--------+----------------------------------+
| sn | amount | sum(sales_global_with_pk.amount) |
+----+--------+----------------------------------+
| 1  | 50.0   | 50.0                             |
| 4  | 100.0  | 100.0                            |
| 0  | 30.0   | 30.0                             |
| 3  | 200.0  | 200.0                            |
| 2  | 75.0   | 75.0                             |
+----+--------+----------------------------------+
5 row(s) fetched. 
Elapsed 0.005 seconds.

> explain select sn, amount, sum(amount) from sales_global_with_pk group by sn;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                    |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[sales_global_with_pk.sn, sales_global_with_pk.amount]], aggr=[[sum(CAST(sales_global_with_pk.amount AS Float64))]] |
|               |   TableScan: sales_global_with_pk projection=[sn, amount]                                                                               |
| physical_plan | AggregateExec: mode=FinalPartitioned, gby=[sn@0 as sn, amount@1 as amount], aggr=[sum(sales_global_with_pk.amount)]                     |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                                           |
|               |     RepartitionExec: partitioning=Hash([sn@0, amount@1], 8), input_partitions=8                                                         |
|               |       RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1                                                              |
|               |         AggregateExec: mode=Partial, gby=[sn@0 as sn, amount@1 as amount], aggr=[sum(sales_global_with_pk.amount)]                      |
|               |           MemoryExec: partitions=1, partition_sizes=[1]                                                                                 |
|               |                                                                                                                                         |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.002 seconds.

goldmedal · 2024-08-09T15:40:21Z

hmm.. Ok, I found #6190 mentioned it's an optimization for the primary key and unique key. I think it may not be an issue. I'll close this.

goldmedal added the bug Something isn't working label Aug 9, 2024

goldmedal closed this as completed Aug 9, 2024

goldmedal mentioned this issue Aug 12, 2024

Move wildcard expansions to the analyzer #11681

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-aggregation expressions should be included in the grouping keys #11903

Non-aggregation expressions should be included in the grouping keys #11903

goldmedal commented Aug 9, 2024 •

edited

Loading

goldmedal commented Aug 9, 2024 •

edited

Loading

goldmedal commented Aug 9, 2024

Non-aggregation expressions should be included in the grouping keys #11903

Non-aggregation expressions should be included in the grouping keys #11903

Comments

goldmedal commented Aug 9, 2024 • edited Loading

Describe the bug

DuckDB

Postgres

To Reproduce

Expected behavior

Additional context

goldmedal commented Aug 9, 2024 • edited Loading

goldmedal commented Aug 9, 2024

goldmedal commented Aug 9, 2024 •

edited

Loading

goldmedal commented Aug 9, 2024 •

edited

Loading