Skip to content

Conversation

@xiaoxuandev
Copy link
Contributor

@xiaoxuandev xiaoxuandev commented Jul 2, 2025

This PR implements limit pushdown optimization for Iceberg on Spark 4.0, enabling early termination during scan task planning to improve performance for LIMIT queries. Resolves: #13383

Notes

When Spark pushes down a LIMIT, it ensures that no additional filters or expressions are present, so this implementation:

  1. Leverages Spark's native partial limit pushdown when available
    (e.g., SELECT * FROM table LIMIT n or queries with partition pruning)

  2. Implements Iceberg-level early termination during task group planning once the required number of records is reached.

  3. disable limit push down when preserve-data-grouping is enabled.

Testing

  • Unit Tests
  • Performance Benchmarks

Benchmark Results

(These results are illustrative, table with large number of data files generally lead to longer execution times if limit push down is disabled.)

1 row per data file

Query Type Limit Push Down Enabled Push Down Disabled Improvement
Limit Query 100 0.093 sec 37.96 sec 99.75% faster
Limit Query 1000 0.484 sec 41.04 sec 98.82% faster
Limit Query 10000 7.023 sec 38.99 sec 81.99% faster

5000 rows per data file

Query Type Limit Push Down Enabled Push Down Disabled Improvement
Limit Query 100 0.0163s 0.0488s 66.5% faster
Limit Query 1000 0.0170s 0.0499s 66.0% faster
Limit Query 10000 0.0177s 0.0632s 71.9% faster

20000 rows per data file

Query Type Limit Push Down Enabled Push Down Disabled Improvement
Limit Query 100 0.0416s 0.0529s 21.4% faster
Limit Query 1000 0.0421s 0.0524s 19.7% faster
Limit Query 10000 0.0422s 0.0576s 26.7% faster

@xiaoxuandev
Copy link
Contributor Author

@manuzhang
Copy link
Member

@xiaoxuandev If the changes are the same, let's target Spark 4.0 first and backport to 3.5 later (what about 3.4?).

@xiaoxuandev xiaoxuandev force-pushed the support-spark-limit-pushdown-4.0 branch from 0f3b68b to 8e970b2 Compare July 3, 2025 16:10
@xiaoxuandev xiaoxuandev changed the title Spark 3.5, 4.0: Support Spark Partial Limit Push Down Spark 4.0: Support Spark Partial Limit Push Down Jul 3, 2025
@xiaoxuandev
Copy link
Contributor Author

@manuzhang That makes sense. I’ve updated the PR to target 4.0 only. We could backporting to 3.4 as well.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 12, 2025
@xiaoxuandev
Copy link
Contributor Author

Hi @amogh-jahagirdar, would you be able to help take a look at this? Thanks!

@github-actions github-actions bot removed the stale label Sep 13, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 13, 2025
@huaxingao huaxingao removed the stale label Oct 13, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg BatchScan & SparkDistributedDataScan to support limit pushdown

3 participants