-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: ignore non-existent columns when adding filter equivalence info in FileScanConfig
#17546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ignore non-existent columns when adding filter equivalence info in FileScanConfig
#17546
Conversation
…in FileScanConfig
The fact that |
Re-ran TPCH benchmark with the same configuration as the referenced issue and all the tests pass now. Will add a regression test here in a bit! |
9dffcd1
to
f6e8894
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the issue which I would like to clarify + document in the PR (maybe in the code) if it is correct is that this applies to a situation like:
SELECT a
FROM t
WHERE b = 2
So it makes sense that we have a filter that references the column b
which is not projected. The filter referencing a non-projected column is not an issue, and we can compute equivalence properties for a non-projected column but things downstream (not touched in this PR) will fail / error if they get equivalence properties for a column that is not in the projection because equivalence properties are to communicate information to the parent operator and as far as the parent operator is concerned there is no column b
.
macro_rules! ignore_dangling_col { | ||
($col:expr) => { | ||
if let Some(col) = $col.as_any().downcast_ref::<Column>() { | ||
if schema.index_of(col.name()).is_err() { | ||
continue; | ||
} | ||
} | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be recursive? I think the answer is no because collect_columns_from_predicate
is guaranteed to return a Vec<Column>
even if the returned type is Vec<Arc<dyn PhysicalExpr>>
. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by this point the expressions returned by collect_columns_from_predicate
should be column/literal equi-pairs since those are what will be successfully pushed down. So yeah don't think there's a need for recursively inspecting the expressions
Yes, exactly! The actual errors were happening due to usage downstream when the filter column (not in the output schema) still existed in the equivalence info and and there was an attempt to do something with it. Specifically in the case I saw, |
@adriangb Does this look ready to merge? |
Yup! |
…in `FileScanConfig` (apache#17546)
…in `FileScanConfig` (apache#17546)
DF v50 won't work for us without apache/datafusion#17546 We'll need to wait for v50.0.1 apache/datafusion#17594
Which issue does this PR close?
Closes #17511
Rationale for this change
When building equal conditions in a data source node, we want to ignore any stale references to columns that may have been swapped out (e.g. from
try_swapping_with_projection
).The current code reassigns predicate columns from the filter to refer to the corresponding ones in the updated schema. However, it only ignores non-projected columns.
reassign_predicate_columns
builds an invalid column expression (with indexusize::MAX
) if the column is not projected in the current schema. We don't want to refer to this in the equal conditions we build.What changes are included in this PR?
Ignores any binary expressions that reference non-existent columns in the current schema (e.g. due to unnecessary projections being removed).
Are these changes tested?
Yes, unit test added
Are there any user-facing changes?
N/A