Skip to content

Incorrect answers due to an incorrectly removed Sort before AggregateExec #20244

@alamb

Description

@alamb

Describe the bug

While debugging the DataFusion 52 upgrade, I found a wrong results bug with pre-sorted data that was introduced in 52

To Reproduce

CREATE TABLE agg_src(x INT, y INT, v INT) AS VALUES
(1, 1, 10),
(1, 2, 20),
(1, 3, 30),
(2, 1, 40),
(2, 2, 50),
(2, 3, 60);

-- Create an ordered table:
COPY (SELECT * FROM agg_src ORDER BY x, y) TO 'foo.parquet';

Then run

CREATE EXTERNAL TABLE agg_src_sorted(x INT, y INT, v INT) STORED AS PARQUET LOCATION 'foo.parquet' WITH ORDER (x ASC, y ASC);

set datafusion.execution.target_partitions = 1;

-- This query orders by an expresson of y that breaks the ordering 
SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;

With Datafusion 52, you get the wrong answer:

andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-52.1.0
> SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 1                           | 40                    |
| 1 | 0                           | 20                    | <---- the second column is 1 then 0, rather than 0 then 1
| 2 | 1                           | 100                   |
| 2 | 0                           | 50                    |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.006 seconds.

On datafusion 51

andrewlamb@Andrews-MacBook-Pro-3 ~ % ~/Software/datafusion-cli/datafusion-cli-51.0.0

You get the expected answer

> SELECT
  x,
  CAST(y AS BIGINT) % 2,
  SUM(v)
FROM agg_src_sorted
GROUP BY x, CAST(y AS BIGINT) % 2
ORDER BY x, CAST(y AS BIGINT) % 2;
+---+-----------------------------+-----------------------+
| x | agg_src_sorted.y % Int64(2) | sum(agg_src_sorted.v) |
+---+-----------------------------+-----------------------+
| 1 | 0                           | 20                    | <---- this row is in the correct sopt
| 1 | 1                           | 40                    |
| 2 | 0                           | 50                    |
| 2 | 1                           | 100                   |
+---+-----------------------------+-----------------------+
4 row(s) fetched.
Elapsed 0.002 seconds.

### Expected behavior

_No response_

### Additional context

_No response_

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions