Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix propagation of optimized predicates on nested projections #3228

Merged
merged 3 commits into from
Aug 29, 2022

Conversation

isidentical
Copy link
Contributor

Which issue does this PR close?

Closes #3073.

Rationale for this change

This PR prevents the removal of predicates without referencing columns (WHERE FALSE, WHERE 1=1, etc.) that might have been created during the column name replacing phase (on filter pushdown optimizer when dealing with projections specifically).

What changes are included in this PR?

Columnless predicates are now collected into a separate entity, stripped away from the actual list of filters when switching the projection/filter and then re-applied.

Are there any user-facing changes?

This should fix the bug referenced in #3073

@github-actions github-actions bot added the optimizer Optimizer rules label Aug 22, 2022
@isidentical isidentical marked this pull request as ready for review August 23, 2022 10:59
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look reasonable to me, but I am not very familiar with some of this code so will need to look closer. I will make time in the next day or two.

@Dandandan
Copy link
Contributor

I'm not sure if this fixes the bug in issue #3073? I'm not totally convinced we should not propagate filters without column (e.g. constants), the result should remain the same whether it is propagated or not. Issue #3073 seems to be about a filter expression with a column, that somehow doesn't filter the row out.

Could we add a test for #3073 here?

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for #3037 and explain why changes are needed

@isidentical
Copy link
Contributor Author

Issue #3073 seems to be about a filter expression with a column, that somehow doesn't filter the row out.

@Dandandan since the example in #3073 uses an in-memory table, the filter where f.a=2 got replaced with where 1=2 in a step of the optimizer (which, at least for me, makes perfect sense) when we are rewriting columns here:
https://github.com/apache/arrow-datafusion/blob/873b071dff1a6099d30abdd24437e083a60e2686/datafusion/optimizer/src/filter_push_down.rs#L396

I'm not sure if this fixes the bug in issue #3073? I'm not totally convinced we should not propagate filters without column (e.g. constants), the result should remain the same whether it is propagated or not.

There is already a similar work done by an earlier step of the filter pushdown optimizer which ensures that WHERE 1=2 (and other variants of constant/columns-less filters) are not propagated (since we can't rewrite them if there are no earlier projections to propagate, unlike regular filters). The main thing this PR does is allow this check to be also carried out in Projection part where filters might change due to how we propagate constants.
https://github.com/apache/arrow-datafusion/blob/873b071dff1a6099d30abdd24437e083a60e2686/datafusion/optimizer/src/filter_push_down.rs#L358-L368

@isidentical
Copy link
Contributor Author

Add test for #3037 and explain why changes are needed

Definitely, will be working on that!

@github-actions github-actions bot added the core Core DataFusion crate label Aug 27, 2022
@isidentical isidentical requested a review from Dandandan August 27, 2022 08:05
@Dandandan
Copy link
Contributor

Issue #3073 seems to be about a filter expression with a column, that somehow doesn't filter the row out.

@Dandandan since the example in #3073 uses an in-memory table, the filter where f.a=2 got replaced with where 1=2 in a step of the optimizer (which, at least for me, makes perfect sense) when we are rewriting columns here:
https://github.com/apache/arrow-datafusion/blob/873b071dff1a6099d30abdd24437e083a60e2686/datafusion/optimizer/src/filter_push_down.rs#L396

I'm not sure if this fixes the bug in issue #3073? I'm not totally convinced we should not propagate filters without column (e.g. constants), the result should remain the same whether it is propagated or not.

There is already a similar work done by an earlier step of the filter pushdown optimizer which ensures that WHERE 1=2 (and other variants of constant/columns-less filters) are not propagated (since we can't rewrite them if there are no earlier projections to propagate, unlike regular filters). The main thing this PR does is allow this check to be also carried out in Projection part where filters might change due to how we propagate constants.
https://github.com/apache/arrow-datafusion/blob/873b071dff1a6099d30abdd24437e083a60e2686/datafusion/optimizer/src/filter_push_down.rs#L358-L368

Isn't the issue then that the propagated filters without column are not added to the plan at all, even when we are at the bottom of a plan? E.g. a propagated where false still should be present regardless of whether it has columns in it.

@isidentical
Copy link
Contributor Author

Isn't the issue then that the propagated filters without column are not added to the plan at all, even when we are at the bottom of a plan? E.g. a propagated where false still should be present regardless of whether it has columns in it.

Exactly, at least that is my understanding of this issue. I thought as is, it would be similar to the existing behaviour from #225 but this time done on the projection level rather than filter level.

If it makes sense, I can also change the logic in issue_filters so that when pushing down we can issue filters for both used filters + all the column-less filters.
https://github.com/apache/arrow-datafusion/blob/873b071dff1a6099d30abdd24437e083a60e2686/datafusion/optimizer/src/filter_push_down.rs#L125-L145

@isidentical
Copy link
Contributor Author

@Dandandan I did a re-implementation using the approach I've described below (handling this on the issue_filters level rather than individual plans) in here. I'd be happy to include it here it makes more sense.

@Dandandan
Copy link
Contributor

Yes, this makes most sense, and simplifies the implementation quite a bit.

@codecov-commenter
Copy link

Codecov Report

Merging #3228 (ad0a93d) into master (873b071) will increase coverage by 0.01%.
The diff coverage is 97.50%.

@@            Coverage Diff             @@
##           master    #3228      +/-   ##
==========================================
+ Coverage   85.91%   85.93%   +0.01%     
==========================================
  Files         294      294              
  Lines       53443    53469      +26     
==========================================
+ Hits        45918    45946      +28     
+ Misses       7525     7523       -2     
Impacted Files Coverage Δ
datafusion/optimizer/src/filter_push_down.rs 98.36% <94.73%> (+0.13%) ⬆️
datafusion/core/tests/sql/projection.rs 97.38% <100.00%> (+0.41%) ⬆️
datafusion/expr/src/logical_plan/plan.rs 78.55% <0.00%> (+0.17%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @isidentical !

@Dandandan Dandandan merged commit 7aed4d6 into apache:master Aug 29, 2022
@Dandandan
Copy link
Contributor

Thanks again @isidentical

@ursabot
Copy link

ursabot commented Aug 29, 2022

Benchmark runs are scheduled for baseline = 873b071 and contender = 7aed4d6. 7aed4d6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great change -- I think it is a very positive sign when we fix bugs by deleting code 👍

Thanks @isidentical and @Dandandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect where clause comparison while using table alias
6 participants