-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 4.0: Support Spark Partial Limit Push Down #13451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spark 4.0: Support Spark Partial Limit Push Down #13451
Conversation
|
@xiaoxuandev If the changes are the same, let's target Spark 4.0 first and backport to 3.5 later (what about 3.4?). |
0f3b68b to
8e970b2
Compare
|
@manuzhang That makes sense. I’ve updated the PR to target 4.0 only. We could backporting to 3.4 as well. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Hi @amogh-jahagirdar, would you be able to help take a look at this? Thanks! |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This PR implements limit pushdown optimization for Iceberg on Spark 4.0, enabling early termination during scan task planning to improve performance for
LIMITqueries. Resolves: #13383Notes
When Spark pushes down a LIMIT, it ensures that no additional filters or expressions are present, so this implementation:
Leverages Spark's native partial limit pushdown when available
(e.g.,
SELECT * FROM table LIMIT nor queries with partition pruning)Implements Iceberg-level early termination during task group planning once the required number of records is reached.
disable limit push down when
preserve-data-groupingis enabled.Testing
Benchmark Results
(These results are illustrative, table with large number of data files generally lead to longer execution times if limit push down is disabled.)
1 row per data file
5000 rows per data file
20000 rows per data file