Spark 4.0: Support Spark Partial Limit Push Down #13451

xiaoxuandev · 2025-07-02T22:36:42Z

This PR implements limit pushdown optimization for Iceberg on Spark 4.0, enabling early termination during scan task planning to improve performance for LIMIT queries. Resolves: #13383

Notes

When Spark pushes down a LIMIT, it ensures that no additional filters or expressions are present, so this implementation:

Leverages Spark's native partial limit pushdown when available
(e.g., SELECT * FROM table LIMIT n or queries with partition pruning)
Implements Iceberg-level early termination during task group planning once the required number of records is reached.
disable limit push down when preserve-data-grouping is enabled.

Testing

Unit Tests
Performance Benchmarks

Benchmark Results

(These results are illustrative, table with large number of data files generally lead to longer execution times if limit push down is disabled.)

1 row per data file

Query Type	Limit	Push Down Enabled	Push Down Disabled	Improvement
Limit Query	100	0.093 sec	37.96 sec	99.75% faster
Limit Query	1000	0.484 sec	41.04 sec	98.82% faster
Limit Query	10000	7.023 sec	38.99 sec	81.99% faster

5000 rows per data file

Query Type	Limit	Push Down Enabled	Push Down Disabled	Improvement
Limit Query	100	0.0163s	0.0488s	66.5% faster
Limit Query	1000	0.0170s	0.0499s	66.0% faster
Limit Query	10000	0.0177s	0.0632s	71.9% faster

20000 rows per data file

Query Type	Limit	Push Down Enabled	Push Down Disabled	Improvement
Limit Query	100	0.0416s	0.0529s	21.4% faster
Limit Query	1000	0.0421s	0.0524s	19.7% faster
Limit Query	10000	0.0422s	0.0576s	26.7% faster

xiaoxuandev · 2025-07-03T00:15:17Z

cc: @aokolnychyi @RussellSpitzer @huaxingao @singhpk234

manuzhang · 2025-07-03T02:26:34Z

@xiaoxuandev If the changes are the same, let's target Spark 4.0 first and backport to 3.5 later (what about 3.4?).

xiaoxuandev · 2025-07-03T16:13:28Z

@manuzhang That makes sense. I’ve updated the PR to target 4.0 only. We could backporting to 3.4 as well.

github-actions · 2025-08-11T00:20:14Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-09-12T00:16:11Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

xiaoxuandev · 2025-09-12T17:40:23Z

Hi @amogh-jahagirdar, would you be able to help take a look at this? Thanks!

github-actions · 2025-10-13T00:18:08Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2025-11-13T00:17:48Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions bot added API spark core labels Jul 2, 2025

xiaoxuandev mentioned this pull request Jul 3, 2025

Spark partial limit push down #10943

Closed

Spark 4.0: Support Spark Partial Limit Push Down

8e970b2

xiaoxuandev force-pushed the support-spark-limit-pushdown-4.0 branch from 0f3b68b to 8e970b2 Compare July 3, 2025 16:10

xiaoxuandev changed the title ~~Spark 3.5, 4.0: Support Spark Partial Limit Push Down~~ Spark 4.0: Support Spark Partial Limit Push Down Jul 3, 2025

github-actions bot added the stale label Aug 11, 2025

manuzhang requested review from aokolnychyi and huaxingao August 11, 2025 00:49

github-actions bot removed the stale label Aug 12, 2025

github-actions bot added the stale label Sep 12, 2025

github-actions bot removed the stale label Sep 13, 2025

github-actions bot added the stale label Oct 13, 2025

huaxingao removed the stale label Oct 13, 2025

github-actions bot added the stale label Nov 13, 2025

huaxingao removed the stale label Nov 13, 2025

geruh mentioned this pull request Nov 19, 2025

Spark 4.0, Core: Add Limit pushdown to Scan #14615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark 4.0: Support Spark Partial Limit Push Down #13451

Spark 4.0: Support Spark Partial Limit Push Down #13451

Uh oh!

xiaoxuandev commented Jul 2, 2025 •

edited

Loading

Uh oh!

xiaoxuandev commented Jul 3, 2025

Uh oh!

manuzhang commented Jul 3, 2025

Uh oh!

xiaoxuandev commented Jul 3, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

xiaoxuandev commented Sep 12, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Spark 4.0: Support Spark Partial Limit Push Down #13451

Are you sure you want to change the base?

Spark 4.0: Support Spark Partial Limit Push Down #13451

Uh oh!

Conversation

xiaoxuandev commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Testing

Benchmark Results

1 row per data file

5000 rows per data file

20000 rows per data file

Uh oh!

xiaoxuandev commented Jul 3, 2025

Uh oh!

manuzhang commented Jul 3, 2025

Uh oh!

xiaoxuandev commented Jul 3, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

xiaoxuandev commented Sep 12, 2025

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiaoxuandev commented Jul 2, 2025 •

edited

Loading