Regression: Ordering by joined column doesn't return results #8374

DDtKey · 2023-11-30T12:25:28Z

Describe the bug

After update to datafusion: 33 I've noticed wrong behavior for our internal test with sorting by multiple columns.
It used to work in datafusion: 31

To Reproduce

MRE with datafusion-cli:

CREATE TABLE users AS VALUES('Alice',50),('Bob',100);
CREATE TABLE employees AS VALUES('Alice','Finance'),('Bob','Marketing');

SELECT u.* FROM users u JOIN employees e ON u."column1" = e."column1" ORDER BY u."column1", e."column2";
0 rows in set. Query took 0.002 seconds.

But at the same time, without ordering by joined column it works:

SELECT u.* FROM users u JOIN employees e ON u."column1" = e."column1" ORDER BY u."column1";
+---------+---------+
| column1 | column2 |
+---------+---------+
| Alice   | 50      |
| Bob     | 100     |
+---------+---------+
2 rows in set. Query took 0.002 seconds.

Expected behavior

It should work as before

Additional context

No response

The text was updated successfully, but these errors were encountered:

suxiaogang223 · 2023-11-30T13:11:57Z

hi, could you show more information like building features and platform👀? I run the same sql and got the correct result on m1 Mac, both in debug and release mode.

suxiaogang223 · 2023-11-30T13:59:54Z

hi, could you show more information like building features and platform👀? I run the same sql and got the correct result on m1 Mac, both in debug and release mode.

Sorry, the bug can be triggered on branch-33, I used the wrong code on tag 33.0.0-rc1.

suxiaogang223 · 2023-11-30T14:33:52Z

I tried to explain the sql:

explain SELECT u.* FROM users u JOIN employees e ON u."column1" = e."column1" ORDER BY u."column1", e."column2";

On branch-33,the result is:

+---------------+--------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                 |
+---------------+--------------------------------------------------------------------------------------+
| logical_plan  | Projection: u.column1, u.column2                                                     |
|               |   Sort: u.column1 ASC NULLS LAST, e.column2 ASC NULLS LAST                           |
|               |     Projection: u.column1, u.column2, e.column2                                      |
|               |       Inner Join: u.column1 = e.column1                                              |
|               |         SubqueryAlias: u                                                             |
|               |           TableScan: users projection=[column1, column2]                             |
|               |         SubqueryAlias: e                                                             |
|               |           TableScan: employees projection=[column1, column2]                         |
| physical_plan | SortPreservingMergeExec: [column1@0 ASC NULLS LAST,column2@1 ASC NULLS LAST]         |
|               |   SortExec: expr=[column1@0 ASC NULLS LAST,column2@1 ASC NULLS LAST]                 |
|               |     ProjectionExec: expr=[column1@0 as column1, column2@1 as column2]                |
|               |       CoalesceBatchesExec: target_batch_size=8192                                    |
|               |         HashJoinExec: mode=Partitioned, join_type=Inner, on=[(column1@0, column1@0)] |
|               |           CoalesceBatchesExec: target_batch_size=8192                                |
|               |             RepartitionExec: partitioning=Hash([column1@0], 8), input_partitions=1   |
|               |               MemoryExec: partitions=1, partition_sizes=[1]                          |
|               |           CoalesceBatchesExec: target_batch_size=8192                                |
|               |             RepartitionExec: partitioning=Hash([column1@0], 8), input_partitions=1   |
|               |               MemoryExec: partitions=1, partition_sizes=[1]                          |
|               |                                                                                      |
+---------------+--------------------------------------------------------------------------------------+
2 rows in set. Query took 0.022 seconds.

On branch-31, the result is:

+---------------+-----------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                          |
+---------------+-----------------------------------------------------------------------------------------------+
| logical_plan  | Projection: u.column1, u.column2                                                              |
|               |   Sort: u.column1 ASC NULLS LAST, e.column2 ASC NULLS LAST                                    |
|               |     Projection: u.column1, u.column2, e.column2                                               |
|               |       Inner Join: u.column1 = e.column1                                                       |
|               |         SubqueryAlias: u                                                                      |
|               |           TableScan: users projection=[column1, column2]                                      |
|               |         SubqueryAlias: e                                                                      |
|               |           TableScan: employees projection=[column1, column2]                                  |
| physical_plan | ProjectionExec: expr=[column1@0 as column1, column2@1 as column2]                             |
|               |   SortPreservingMergeExec: [column1@0 ASC NULLS LAST,column2@2 ASC NULLS LAST]                |
|               |     SortExec: expr=[column1@0 ASC NULLS LAST,column2@2 ASC NULLS LAST]                        |
|               |       ProjectionExec: expr=[column1@0 as column1, column2@1 as column2, column2@3 as column2] |
|               |         CoalesceBatchesExec: target_batch_size=8192                                           |
|               |           HashJoinExec: mode=Partitioned, join_type=Inner, on=[(column1@0, column1@0)]        |
|               |             CoalesceBatchesExec: target_batch_size=8192                                       |
|               |               RepartitionExec: partitioning=Hash([column1@0], 8), input_partitions=8          |
|               |                 RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1          |
|               |                   MemoryExec: partitions=1, partition_sizes=[1]                               |
|               |             CoalesceBatchesExec: target_batch_size=8192                                       |
|               |               RepartitionExec: partitioning=Hash([column1@0], 8), input_partitions=8          |
|               |                 RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1          |
|               |                   MemoryExec: partitions=1, partition_sizes=[1]                               |
|               |                                                                                               |
+---------------+-----------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.019 seconds.

The difference is ProjectionExec, on branch-33, the project wrongly excluded the e.column2, so the SortExec can't sort by e.column2.

haohuaijin · 2023-11-30T14:55:43Z

After do some research, I find this error cause by ProjectionPushdown rule in physical optimizer

| physical_plan after OutputRequirements  | ProjectionExec: expr=[column1@0 as column1, column2@1 as column2]                                                                                                                     |
|                                         |   SortPreservingMergeExec: [column1@0 ASC NULLS LAST,column2@2 ASC NULLS LAST]                                                                                                        |
|                                         |     SortExec: expr=[column1@0 ASC NULLS LAST,column2@2 ASC NULLS LAST]                                                                                                                |
|                                         |       ProjectionExec: expr=[column1@0 as column1, column2@1 as column2, column2@3 as column2]  <-- before we have column2@3(e.column2)                                                                                       |
|                                         |         CoalesceBatchesExec: target_batch_size=8192                                                                                                                                   |
|                                         |           HashJoinExec: mode=Partitioned, join_type=Inner, on=[(column1@0, column1@0)]                                                                                                |
|                                         |             CoalesceBatchesExec: target_batch_size=8192                                                                                                                               |
|                                         |               RepartitionExec: partitioning=Hash([column1@0], 24), input_partitions=1                                                                                                 |
|                                         |                 MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                         |
|                                         |             CoalesceBatchesExec: target_batch_size=8192                                                                                                                               |
|                                         |               RepartitionExec: partitioning=Hash([column1@0], 24), input_partitions=1                                                                                                 |
|                                         |                 MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                         |
|                                         |                                                                                                                                                                                       |
| physical_plan after PipelineChecker     | SAME TEXT AS ABOVE                                                                                                                                                                    |
| physical_plan after LimitAggregation    | SAME TEXT AS ABOVE                                                                                                                                                                    |
| physical_plan after ProjectionPushdown  | SortPreservingMergeExec: [column1@0 ASC NULLS LAST,column2@1 ASC NULLS LAST]                                                                                                          |
|                                         |   SortExec: expr=[column1@0 ASC NULLS LAST,column2@1 ASC NULLS LAST]                                                                                                                  |
|                                         |     ProjectionExec: expr=[column1@0 as column1, column2@1 as column2]   <-- after we elimiate column2@3(e.column2)                                                                                                              |
|                                         |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                                     |
|                                         |         HashJoinExec: mode=Partitioned, join_type=Inner, on=[(column1@0, column1@0)]                                                                                                  |
|                                         |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                 |
|                                         |             RepartitionExec: partitioning=Hash([column1@0], 24), input_partitions=1                                                                                                   |
|                                         |               MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                           |
|                                         |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                 |
|                                         |             RepartitionExec: partitioning=Hash([column1@0], 24), input_partitions=1                                                                                                   |
|                                         |               MemoryExec: partitions=1, partition_sizes=[1]

the reason for this rewrite, may be because we only use column name for identify a column in below code：

https://github.com/apache/arrow-datafusion/blob/06bbe1298fa8aa042b6a6462e55b2890969d884a/datafusion/core/src/physical_optimizer/projection_pushdown.rs#L866-L872

When the column names are identical, the error will arise

DDtKey · 2023-11-30T15:00:23Z

When the column names are identical, the error will arise

Just to clarify: in my tests this failed with different column names as well. Just MRE uses auto column names

haohuaijin · 2023-11-30T15:37:34Z

Just to clarify: in my tests this failed with different column names as well. Just MRE uses auto column names

@DDtKey could you provide some cases? When the column name is different, I find it works in datafusion 33

DataFusion CLI v33.0.0
❯ create table u(a text, b int) as values ('Alice', 50), ('Bob', 100);
0 rows in set. Query took 0.023 seconds.

❯ create table e(c text, d text) as values ('Alice', 'Finance'), ('Bob', 'Marketing');
0 rows in set. Query took 0.000 seconds.

❯ select u.* from u join e on u.a = e.c order by u.a, e.d;
+-------+-----+
| a     | b   |
+-------+-----+
| Alice | 50  |
| Bob   | 100 |
+-------+-----+
2 rows in set. Query took 0.021 seconds.

DDtKey · 2023-11-30T15:44:56Z

Sorry for the confusion, you're right
It works with different column names (used in ORDER BY), so that seems to be the root of the problem.

Asura7969 · 2023-12-06T08:20:36Z

My initial solution：

.find_map(|(index, (projected_expr, alias))| {
  projected_expr.as_any().downcast_ref::<Column>().and_then(
      |projected_column| {
          (column.index() == projected_column.index()       <--- and index comparison
              && column.name().eq(projected_column.name()))
          .then(|| {
              state = RewriteState::RewrittenValid;
              Arc::new(Column::new(alias, index)) as _
          })
      },
  )
})

and index comparison

DataFusion CLI v33.0.0
❯ CREATE TABLE users AS VALUES('Alice',50),('Bob',100);
0 rows in set. Query took 0.022 seconds.

❯ CREATE TABLE employees AS VALUES('Alice','Finance'),('Bob','Marketing');
0 rows in set. Query took 0.008 seconds.

❯ SELECT u.* FROM users u JOIN employees e ON u."column1" = e."column1" ORDER BY u."column1", e."column2";
+---------+---------+
| column1 | column2 |
+---------+---------+
| Alice   | 50      |
| Bob     | 100     |
+---------+---------+

The result is correct

Asura7969 · 2023-12-06T08:49:28Z

replenish:

Maybe we should do special processing for SortExec 🤔

haohuaijin · 2023-12-10T14:11:46Z

My initial solution：

.find_map(|(index, (projected_expr, alias))| {
  projected_expr.as_any().downcast_ref::<Column>().and_then(
      |projected_column| {
          (column.index() == projected_column.index()       <--- and index comparison
              && column.name().eq(projected_column.name()))
          .then(|| {
              state = RewriteState::RewrittenValid;
              Arc::new(Column::new(alias, index)) as _
          })
      },
  )
})

use name and index(the index is column index of input schema) to identify a column, should be under the assumption that the input schema of column's plan and projection_column's plan is the same. Otherwise, some projection that can be pushed down may become unable to be pushed down. And when the schema is the same, we can just use the index to identify a column.

DDtKey · 2023-12-27T18:37:18Z

Why don't we consider this issue a regression and continue to release new stable versions?
This worked until version 31, since then we have 34 and the regression is ignored 🤔

note: I'm not talking about bugs in general, but about regressions, unfortunately they occur quite often and they are more dangerous, there is no trust in new versions

Thus we have the following situation:
Each version needs to code adaptation in case of incompatible changes, but it is not correct to use the new version with obvious regression. Thus, the changes are made but cannot be applied (and their number is growing)

I may be a little behind the current datafusion release policy, but I think any regression should be prioritized to release new stable patches. I understand that right now it's just on schedule, but perhaps this is time for a more strict release cycle of stable versions? (question for a separate issue ofc)

cc @alamb

alamb · 2023-12-28T19:57:00Z

note: I'm not talking about bugs in general, but about regressions, unfortunately they occur quite often and they are more dangerous, there is no trust in new versions

Thank you for bringing this up -- I agree we need to prioritize regressions -- I personally missed this particular bug as a regression and thought it was a pre-existing bug. I have updated the title to reflect this and created a new tag for regressions

cc @andygrove @viirya and @ozankabak

ozankabak · 2023-12-28T20:04:53Z

@DDtKey I think some people may have assumed #8485 fixed it (at least I did). You are right that such regressions should get priority and we will prioritize this.

DDtKey · 2023-12-28T20:25:24Z

@ozankabak thanks for pointing to the PR. Looks like I've missed that it has been merged prior to releasing 34.0.0 (and the issue has not been closed yet).

So that's my wrong assumption, sorry (to be more clear, my test still fails, but due to another issue #7931, not related to this one, gonna check additionally - it used to work in 31)
I tested MRE and this case works with the latest stable version
Though, as it's been mentioned we may have some underlying issues, but not related to this one.

alamb · 2023-12-28T20:53:59Z

BTW one of the longer term discussions I would like to have at #8152 and in other venues (I just haven't had time to write it down yet) is how to improve the overall "process maturity" of datafusion -- like @DDtKey points out that regressions should be prioritized, but at the moment we don't really have a mechanism to do that (or, for example, hold the release for such regressions) other than by relying on one of us to catch it manually

DDtKey · 2024-01-04T11:58:55Z

Should we close this issue as fixed in 34.0.0 to avoid confusion?

ozankabak · 2024-01-04T12:02:46Z

@DDtKey sounds good 👍

DDtKey added the bug Something isn't working label Nov 30, 2023

Jefffrey mentioned this issue Nov 30, 2023

Consider introducing unique expression IDs in Logical/Physical plan #8379

Open

alamb mentioned this issue Dec 1, 2023

[Epic] A collection of Join Improvements #8398

Open

10 tasks

haohuaijin mentioned this issue Dec 10, 2023

fix: incorrect set preserve_partitioning in SortExec #8485

Merged

alamb changed the title ~~Ordering by joined column doesn't return results~~ Regression: Ordering by joined column doesn't return results Dec 28, 2023

alamb added the regression Something that used to work no longer does label Dec 28, 2023

ozankabak closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: Ordering by joined column doesn't return results #8374

Regression: Ordering by joined column doesn't return results #8374

DDtKey commented Nov 30, 2023 •

edited

Loading

suxiaogang223 commented Nov 30, 2023

suxiaogang223 commented Nov 30, 2023

suxiaogang223 commented Nov 30, 2023

haohuaijin commented Nov 30, 2023

DDtKey commented Nov 30, 2023

haohuaijin commented Nov 30, 2023 •

edited

Loading

DDtKey commented Nov 30, 2023 •

edited

Loading

Asura7969 commented Dec 6, 2023

Asura7969 commented Dec 6, 2023

haohuaijin commented Dec 10, 2023

DDtKey commented Dec 27, 2023 •

edited

Loading

alamb commented Dec 28, 2023 •

edited

Loading

ozankabak commented Dec 28, 2023

DDtKey commented Dec 28, 2023 •

edited

Loading

alamb commented Dec 28, 2023

DDtKey commented Jan 4, 2024

ozankabak commented Jan 4, 2024

Regression: Ordering by joined column doesn't return results #8374

Regression: Ordering by joined column doesn't return results #8374

Comments

DDtKey commented Nov 30, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

suxiaogang223 commented Nov 30, 2023

suxiaogang223 commented Nov 30, 2023

suxiaogang223 commented Nov 30, 2023

haohuaijin commented Nov 30, 2023

DDtKey commented Nov 30, 2023

haohuaijin commented Nov 30, 2023 • edited Loading

DDtKey commented Nov 30, 2023 • edited Loading

Asura7969 commented Dec 6, 2023

Asura7969 commented Dec 6, 2023

haohuaijin commented Dec 10, 2023

DDtKey commented Dec 27, 2023 • edited Loading

alamb commented Dec 28, 2023 • edited Loading

ozankabak commented Dec 28, 2023

DDtKey commented Dec 28, 2023 • edited Loading

alamb commented Dec 28, 2023

DDtKey commented Jan 4, 2024

ozankabak commented Jan 4, 2024

DDtKey commented Nov 30, 2023 •

edited

Loading

haohuaijin commented Nov 30, 2023 •

edited

Loading

DDtKey commented Nov 30, 2023 •

edited

Loading

DDtKey commented Dec 27, 2023 •

edited

Loading

alamb commented Dec 28, 2023 •

edited

Loading

DDtKey commented Dec 28, 2023 •

edited

Loading