fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` #17546

rkrishn7 · 2025-09-12T20:25:36Z

Which issue does this PR close?

Rationale for this change

When building equal conditions in a data source node, we want to ignore any stale references to columns that may have been swapped out (e.g. from try_swapping_with_projection).

The current code reassigns predicate columns from the filter to refer to the corresponding ones in the updated schema. However, it only ignores non-projected columns. reassign_predicate_columns builds an invalid column expression (with index usize::MAX) if the column is not projected in the current schema. We don't want to refer to this in the equal conditions we build.

What changes are included in this PR?

Ignores any binary expressions that reference non-existent columns in the current schema (e.g. due to unnecessary projections being removed).

Are these changes tested?

Yes, unit test added

Are there any user-facing changes?

N/A

…in FileScanConfig

rkrishn7 · 2025-09-12T20:27:22Z

The fact that reassign_predicate_columns can return invalid column expressions seems like a big footgun. I didn't change within this PR due to usage elsewhere but we might want to think about refactoring it

rkrishn7 · 2025-09-12T20:40:42Z

cc @adriangb I think this was inadvertently introduced in #17323

rkrishn7 · 2025-09-12T20:42:27Z

Re-ran TPCH benchmark with the same configuration as the referenced issue and all the tests pass now.

Will add a regression test here in a bit!

adriangb

My understanding of the issue which I would like to clarify + document in the PR (maybe in the code) if it is correct is that this applies to a situation like:

SELECT a
FROM t
WHERE b = 2

So it makes sense that we have a filter that references the column b which is not projected. The filter referencing a non-projected column is not an issue, and we can compute equivalence properties for a non-projected column but things downstream (not touched in this PR) will fail / error if they get equivalence properties for a column that is not in the projection because equivalence properties are to communicate information to the parent operator and as far as the parent operator is concerned there is no column b.

adriangb · 2025-09-15T15:06:42Z

datafusion/datasource/src/file_scan_config.rs

+        macro_rules! ignore_dangling_col {
+            ($col:expr) => {
+                if let Some(col) = $col.as_any().downcast_ref::<Column>() {
+                    if schema.index_of(col.name()).is_err() {
+                        continue;
+                    }
+                }
+            };
+        }


Does this need to be recursive? I think the answer is no because collect_columns_from_predicate is guaranteed to return a Vec<Column> even if the returned type is Vec<Arc<dyn PhysicalExpr>>. Is that right?

I think by this point the expressions returned by collect_columns_from_predicate should be column/literal equi-pairs since those are what will be successfully pushed down. So yeah don't think there's a need for recursively inspecting the expressions

adriangb · 2025-09-15T15:11:45Z

cc @adriangb I think this was inadvertently introduced in #17323

Sorry for the delay in reviewing this and sorry to the community if this doesn't make it into 50.0.0 and it ships with this bug. I didn't have a moment to review over the weekend 😞

rkrishn7 · 2025-09-15T17:35:48Z

My understanding of the issue which I would like to clarify + document in the PR (maybe in the code) if it is correct is that this applies to a situation like:
SELECT a
FROM t
WHERE b = 2
So it makes sense that we have a filter that references the column b which is not projected. The filter referencing a non-projected column is not an issue, and we can compute equivalence properties for a non-projected column but things downstream (not touched in this PR) will fail / error if they get equivalence properties for a column that is not in the projection because equivalence properties are to communicate information to the parent operator and as far as the parent operator is concerned there is no column b.

Yes, exactly! The actual errors were happening due to usage downstream when the filter column (not in the output schema) still existed in the equivalence info and and there was an attempt to do something with it. Specifically in the case I saw, physical_expr::add_offset_to_expr was failing because the column index was set to usize::MAX (I guess to indicate that it wasn't valid anymore). But downstream wasn't aware so the add was still occurring.

rkrishn7 · 2025-09-15T17:36:21Z

cc @adriangb I think this was inadvertently introduced in #17323

Sorry for the delay in reviewing this and sorry to the community if this doesn't make it into 50.0.0 and it ships with this bug. I didn't have a moment to review over the weekend 😞

No worries! 🤝

rkrishn7 · 2025-09-16T03:04:48Z

@adriangb Does this look ready to merge?

adriangb · 2025-09-16T03:10:59Z

Yup!

…in `FileScanConfig` (apache#17546)

…in `FileScanConfig` (#17546) (#17600)

…in `FileScanConfig` (apache#17546)

DF v50 won't work for us without apache/datafusion#17546 We'll need to wait for v50.0.1 apache/datafusion#17594

fix: ignore non-existent columns when adding filter equivalence info …

f27df43

…in FileScanConfig

github-actions bot added the datasource Changes to the datasource crate label Sep 12, 2025

add test

f6e8894

rkrishn7 force-pushed the fix/ignore-irrelevant-equivalence-relations-filescan branch from 9dffcd1 to f6e8894 Compare September 15, 2025 05:37

rkrishn7 mentioned this pull request Sep 15, 2025

Release DataFusion 50.0.0 (Aug/Sep 2025) #16799

Closed

45 tasks

adriangb approved these changes Sep 15, 2025

View reviewed changes

adriangb merged commit 62f214f into apache:main Sep 16, 2025
29 checks passed

rkrishn7 mentioned this pull request Sep 16, 2025

Refactor usage of reassign_predicate_columns #17581

Closed

alamb mentioned this pull request Sep 16, 2025

Release DataFusion 50.1.0 (minor) #17594

Closed

18 tasks

rkrishn7 added a commit to rkrishn7/datafusion that referenced this pull request Sep 16, 2025

fix: ignore non-existent columns when adding filter equivalence info …

4f46d96

…in `FileScanConfig` (apache#17546)

alamb mentioned this pull request Sep 16, 2025

[branch-50] fix: ignore non-existent columns when adding filter equivalence info in FileScanConfig (#17546) #17600

Merged

alamb pushed a commit that referenced this pull request Sep 16, 2025

fix: ignore non-existent columns when adding filter equivalence info …

891202a

…in `FileScanConfig` (#17546) (#17600)

bvolpato added a commit to bvolpato/datafusion that referenced this pull request Sep 16, 2025

fix: ignore non-existent columns when adding filter equivalence info …

e03c3d5

…in `FileScanConfig` (apache#17546)

XiangpengHao mentioned this pull request Sep 17, 2025

Bump to DataFusion v50.0.1 XiangpengHao/liquid-cache#370

Merged

XiangpengHao added a commit to XiangpengHao/liquid-cache that referenced this pull request Oct 7, 2025

Bump to DataFusion v50.0.1 (#370)

f29fca5

DF v50 won't work for us without apache/datafusion#17546 We'll need to wait for v50.0.1 apache/datafusion#17594

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` #17546

fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` #17546

Uh oh!

rkrishn7 commented Sep 12, 2025 •

edited

Loading

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

adriangb left a comment

Uh oh!

adriangb Sep 15, 2025

Uh oh!

rkrishn7 Sep 15, 2025

Uh oh!

adriangb commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 16, 2025

Uh oh!

Uh oh!

adriangb commented Sep 16, 2025

Uh oh!

Uh oh!

fix: ignore non-existent columns when adding filter equivalence info in FileScanConfig #17546

fix: ignore non-existent columns when adding filter equivalence info in FileScanConfig #17546

Uh oh!

Conversation

rkrishn7 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

rkrishn7 commented Sep 12, 2025

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 15, 2025

Uh oh!

rkrishn7 commented Sep 16, 2025

Uh oh!

Uh oh!

adriangb commented Sep 16, 2025

Uh oh!

Uh oh!

fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` #17546

fix: ignore non-existent columns when adding filter equivalence info in `FileScanConfig` #17546

rkrishn7 commented Sep 12, 2025 •

edited

Loading