Skip to content

Conversation

rkrishn7
Copy link
Contributor

@rkrishn7 rkrishn7 commented Sep 12, 2025

Which issue does this PR close?

Closes #17511

Rationale for this change

When building equal conditions in a data source node, we want to ignore any stale references to columns that may have been swapped out (e.g. from try_swapping_with_projection).

The current code reassigns predicate columns from the filter to refer to the corresponding ones in the updated schema. However, it only ignores non-projected columns. reassign_predicate_columns builds an invalid column expression (with index usize::MAX) if the column is not projected in the current schema. We don't want to refer to this in the equal conditions we build.

What changes are included in this PR?

Ignores any binary expressions that reference non-existent columns in the current schema (e.g. due to unnecessary projections being removed).

Are these changes tested?

Yes, unit test added

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the datasource Changes to the datasource crate label Sep 12, 2025
@rkrishn7
Copy link
Contributor Author

The fact that reassign_predicate_columns can return invalid column expressions seems like a big footgun. I didn't change within this PR due to usage elsewhere but we might want to think about refactoring it

@rkrishn7
Copy link
Contributor Author

cc @adriangb I think this was inadvertently introduced in #17323

@rkrishn7
Copy link
Contributor Author

Re-ran TPCH benchmark with the same configuration as the referenced issue and all the tests pass now.

Will add a regression test here in a bit!

@rkrishn7 rkrishn7 force-pushed the fix/ignore-irrelevant-equivalence-relations-filescan branch from 9dffcd1 to f6e8894 Compare September 15, 2025 05:37
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the issue which I would like to clarify + document in the PR (maybe in the code) if it is correct is that this applies to a situation like:

SELECT a
FROM t
WHERE b = 2

So it makes sense that we have a filter that references the column b which is not projected. The filter referencing a non-projected column is not an issue, and we can compute equivalence properties for a non-projected column but things downstream (not touched in this PR) will fail / error if they get equivalence properties for a column that is not in the projection because equivalence properties are to communicate information to the parent operator and as far as the parent operator is concerned there is no column b.

Comment on lines +767 to +775
macro_rules! ignore_dangling_col {
($col:expr) => {
if let Some(col) = $col.as_any().downcast_ref::<Column>() {
if schema.index_of(col.name()).is_err() {
continue;
}
}
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be recursive? I think the answer is no because collect_columns_from_predicate is guaranteed to return a Vec<Column> even if the returned type is Vec<Arc<dyn PhysicalExpr>>. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by this point the expressions returned by collect_columns_from_predicate should be column/literal equi-pairs since those are what will be successfully pushed down. So yeah don't think there's a need for recursively inspecting the expressions

@adriangb
Copy link
Contributor

cc @adriangb I think this was inadvertently introduced in #17323

Sorry for the delay in reviewing this and sorry to the community if this doesn't make it into 50.0.0 and it ships with this bug. I didn't have a moment to review over the weekend 😞

@rkrishn7
Copy link
Contributor Author

My understanding of the issue which I would like to clarify + document in the PR (maybe in the code) if it is correct is that this applies to a situation like:

SELECT a
FROM t
WHERE b = 2

So it makes sense that we have a filter that references the column b which is not projected. The filter referencing a non-projected column is not an issue, and we can compute equivalence properties for a non-projected column but things downstream (not touched in this PR) will fail / error if they get equivalence properties for a column that is not in the projection because equivalence properties are to communicate information to the parent operator and as far as the parent operator is concerned there is no column b.

Yes, exactly! The actual errors were happening due to usage downstream when the filter column (not in the output schema) still existed in the equivalence info and and there was an attempt to do something with it. Specifically in the case I saw, physical_expr::add_offset_to_expr was failing because the column index was set to usize::MAX (I guess to indicate that it wasn't valid anymore). But downstream wasn't aware so the add was still occurring.

@rkrishn7
Copy link
Contributor Author

cc @adriangb I think this was inadvertently introduced in #17323

Sorry for the delay in reviewing this and sorry to the community if this doesn't make it into 50.0.0 and it ships with this bug. I didn't have a moment to review over the weekend 😞

No worries! 🤝

@rkrishn7
Copy link
Contributor Author

@adriangb Does this look ready to merge?

@adriangb adriangb merged commit 62f214f into apache:main Sep 16, 2025
29 checks passed
@adriangb
Copy link
Contributor

Yup!

@alamb alamb mentioned this pull request Sep 16, 2025
18 tasks
rkrishn7 added a commit to rkrishn7/datafusion that referenced this pull request Sep 16, 2025
alamb pushed a commit that referenced this pull request Sep 16, 2025
bvolpato added a commit to bvolpato/datafusion that referenced this pull request Sep 16, 2025
XiangpengHao added a commit to XiangpengHao/liquid-cache that referenced this pull request Oct 7, 2025
DF v50 won't work for us without
apache/datafusion#17546

We'll need to wait for v50.0.1
apache/datafusion#17594
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TPCH(sf=1) Q10 and Q19 failures when running with datafusion.execution.parquet.pushdown_filters
2 participants