Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parquet filter pushdown by default #3463

Open
5 of 6 tasks
Tracked by #3462
alamb opened this issue Sep 13, 2022 · 17 comments · May be fixed by #12524
Open
5 of 6 tasks
Tracked by #3462

Enable parquet filter pushdown by default #3463

alamb opened this issue Sep 13, 2022 · 17 comments · May be fixed by #12524

Comments

@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

In #3380 @thinkharderdev added support for evaluating filters during the parquet scan via the RowIndex mechanism 🎉

This feature is currently enabled via a feature flag, which is disabled by default.

This ticket tracks enabling this feature by default.

Currently known items are:

@alamb alamb self-assigned this Oct 13, 2022
@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2022

I plan to review the parquet test coverage over the next day or two

My basic plan is to:

  1. Make a draft PR that enables pushdown by default and see if any regression tests fail
  2. Do some manual testing using datafusion-cli and some internal datasets

@alamb
Copy link
Contributor Author

alamb commented Oct 13, 2022

The PR that enables the feature is #3828 and it looks quite promising

Also, I have a proposed config change that I would like to get in, #3822, which would allow users to quickly disable this feature if they found issues

@alamb
Copy link
Contributor Author

alamb commented Oct 26, 2022

Update here: I am working on some fuzz testing for the predicate pushdown code

@alamb
Copy link
Contributor Author

alamb commented Oct 28, 2022

I found a bug in pushdown by writing a test #4005

@alamb
Copy link
Contributor Author

alamb commented Oct 28, 2022

I found another one with my test. I will keep the list on this ticket updated

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2022

I think we are getting close

@alamb
Copy link
Contributor Author

alamb commented Nov 2, 2022

Update here is I think once we get #3976 merged I'll put up the PR to enable the feature by default

@Dandandan
Copy link
Contributor

@alamb What's the remaining work that needs to be done to enable it by default? Anything I can do to get this over the finish line?

@alamb
Copy link
Contributor Author

alamb commented Feb 5, 2023

@Dandandan ❤️ thank you

I know of two major items:

  1. Enable page index by default Enable parquet page level skipping (page index pruning) by default #4085
  2. Ensure that enabling filter pushdown by default doesn't cause performance regressions in cases where the filter isn't very selective. I know @tustvold had some thoughts in this area.

For item 1, I tried a few times, but am currently blocked by #5104 and haven't had time to return to it

For item 2, I think someone needs to look at some benchmark results (perhaps TPCH) and figure out what we can do to avoid the regression (or measure and determine it isn't significant)

I keep hoping to have time to work on item 1, as I am pretty sure I know what the problem is, but other things keep coming up :(

I haven't had a chance to work on 2 yet

@alamb
Copy link
Contributor Author

alamb commented Mar 2, 2023

Update here is I am working on a larger benchmarking story, part of which would give us more confidence to merge changes like this in. I hope to have that done early next week

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

With the introduction of bench.sh with tpch and clickbench (#6994) I think we are now in a better position to do this experiment again and fix any performance regressions

@tustvold
Copy link
Contributor

apache/arrow-rs#5523 might help mitigate the impact of pushing down predicates that turn out to not be very selective

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2024

It would be really nice to (finally) be able to turn this optimization on -- thank you @tustvold

cc @Dandandan

@alamb alamb removed their assignment Aug 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2024

I tried again now that #4028 is done in #12524. Sadly many of the queries still slow down. I will try and find time to figure out why

@Dandandan
Copy link
Contributor

@alamb I wonder if you already have some further insights?

@Dandandan
Copy link
Contributor

I am just reading the latest messages on apache/arrow-rs#5523 , thanks for the updates

@alamb
Copy link
Contributor Author

alamb commented Oct 22, 2024

The writeup here from @tustvold is quite good: apache/arrow-rs#6454 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants