-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: Ordering by joined column doesn't return results #8374
Comments
hi, could you show more information like building features and platform👀? I run the same sql and got the correct result on m1 Mac, both in debug and release mode. |
Sorry, the bug can be triggered on branch-33, I used the wrong code on tag 33.0.0-rc1. |
I tried to explain the sql: explain SELECT u.* FROM users u JOIN employees e ON u."column1" = e."column1" ORDER BY u."column1", e."column2"; On branch-33,the result is:
On branch-31, the result is:
The difference is ProjectionExec, on branch-33, the project wrongly excluded the e.column2, so the SortExec can't sort by e.column2. |
After do some research, I find this error cause by
the reason for this rewrite, may be because we only use column name for identify a column in below code: When the column names are identical, the error will arise |
Just to clarify: in my tests this failed with different column names as well. Just MRE uses auto column names |
@DDtKey could you provide some cases? When the column name is different, I find it works in
|
Sorry for the confusion, you're right |
My initial solution: .find_map(|(index, (projected_expr, alias))| {
projected_expr.as_any().downcast_ref::<Column>().and_then(
|projected_column| {
(column.index() == projected_column.index() <--- and index comparison
&& column.name().eq(projected_column.name()))
.then(|| {
state = RewriteState::RewrittenValid;
Arc::new(Column::new(alias, index)) as _
})
},
)
})
The result is correct |
use name and index(the index is column index of input schema) to identify a column, should be under the assumption that the input schema of |
Why don't we consider this issue a regression and continue to release new stable versions? note: I'm not talking about bugs in general, but about regressions, unfortunately they occur quite often and they are more dangerous, there is no trust in new versions Thus we have the following situation: I may be a little behind the current cc @alamb |
Thank you for bringing this up -- I agree we need to prioritize regressions -- I personally missed this particular bug as a regression and thought it was a pre-existing bug. I have updated the title to reflect this and created a new tag for regressions cc @andygrove @viirya and @ozankabak |
@ozankabak thanks for pointing to the PR. Looks like I've missed that it has been merged prior to releasing So that's my wrong assumption, sorry (to be more clear, my test still fails, but due to another issue #7931, not related to this one, gonna check additionally - it used to work in |
BTW one of the longer term discussions I would like to have at #8152 and in other venues (I just haven't had time to write it down yet) is how to improve the overall "process maturity" of datafusion -- like @DDtKey points out that regressions should be prioritized, but at the moment we don't really have a mechanism to do that (or, for example, hold the release for such regressions) other than by relying on one of us to catch it manually |
Should we close this issue as fixed in |
@DDtKey sounds good 👍 |
Describe the bug
After update to
datafusion: 33
I've noticed wrong behavior for our internal test with sorting by multiple columns.It used to work in
datafusion: 31
To Reproduce
MRE with datafusion-cli:
But at the same time, without ordering by joined column it works:
Expected behavior
It should work as before
Additional context
No response
The text was updated successfully, but these errors were encountered: