API, Core: Scan API for partition stats by gaborkaszab · Pull Request #14640 · apache/iceberg

gaborkaszab · 2025-11-20T09:10:20Z

Background: The current the way to query partition stats is through PartitionStatsHandler.readPartitionStatsFile(). For this the user has to put together the schema and get the input file to read. It would be beneficial for easier usability to have a more convenient API to scan partition stats similar to the other scan APIs. This could also have filter and projection capabilities for better read performance.

Context: #14508 contains the changes required to introduce the new API covering the read functionalities existing today. To reduce the scope and the size of the review it's split into multiple steps and this one is the first with the following scope:

Introducing a new API to scan partition statistics
Providing an implementation for this new API
Deprecating the old way of querying partition stats
However, the usage of the deprecated functionality is not replace with the new API. That's a follow-up step.

For more details see #14508

gaborkaszab · 2025-11-20T09:25:03Z

Note, this is split from this PR concentrating on introducing, implementing and testing the new API. In #14508 you can see how it would look like to remove the usage from the deprecated functionality.

gaborkaszab · 2025-11-20T11:03:28Z

Hey @nastra ,
You were busy working on stats recently, do you think you can a look at this? The original PR was #14508 that contains more details and also replacing the deprecated functionality with the new one. To keep the PR small, I split the introduction of new functionality into this separate PR.

api/src/main/java/org/apache/iceberg/Table.java

gaborkaszab · 2025-12-01T09:20:50Z

Hey @nastra @ajantha-bhat @pvary ,
I ping you because you all were involved with stats for some extent. Would you mind taking a look at this? Note, I have a PoC PR that contains the full picture, also replacing old usage and all. This is is a split from that introducing the new API and it's implementation, but doesn't replace usage of old functionality.

gaborkaszab · 2025-12-09T13:52:50Z

@nastra @ajantha-bhat Do you think you can take a look at this? You were involved with stats stuff recently.

pvary

Looks good to me.
Let's wait a bit for others to chime in if they want. I will merge if there are no new comments.

nastra · 2025-12-12T12:09:07Z

sorry I've been busy with internal things but I'll try to review this next week

core/src/test/java/org/apache/iceberg/PartitionStatisticsTestBase.java

core/src/test/java/org/apache/iceberg/TestBasePartitionStatisticsScan.java

findinpath · 2025-12-15T11:05:55Z

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

+            .filter(f -> f.snapshotId() == snapshotId)
+            .findFirst();
+
+    if (statsFile.isEmpty()) {


If the snapshot is missing partition stats, maybe a previous snapshot may have them.

Any partition stats may be better than no partition stats.

Thanks for taking a look, @findinpath !
I think in general the "get me the latest available stats" use-case makes sense. However, it would introduce much ambiguity into the API because we won't know how fresh the returned stats are. I think we have 2 option here:

introduce a way to provide a snapshot ID (other than useSnapshot()) that will explicitly be used for these latest available searches. I think such a functionality should return somehow that how many steps were required to get the stats, and maybe some metrics about the stats themselves that doesn't have partition stats in the chain. E.g. 3 steps had to make to find partition stats and the skipped snapshots added 12 data data files etc. The engine can then judge whether using such stats makes sense or not.

Let the engine do the work to find the snapshot with partition stats. This way the engine directly can judge if it makes sense to use such stats. (e.g. already skipped half of the snapshots, won't continue following the chain). I think this is the cleaner approach at the moment. For this PartitionStatsHandler.latestStatsFile() could help how to find the latest snapshot having partition stats.

I’ll go with the second option for now. If we later identify a common or generic need to fetch statistics that are "around" a specific snapshot, we can introduce a new API method or add a utility function in the Iceberg core

This PR adds a new API to scan partition statistics, and provides an implementation for this new API. Deprecates the old way of querying partition stats, however, doesn't replace the usage of the deprecated functionality with the new API.

pvary · 2026-01-07T12:58:48Z

Merged to main.
Thanks @gaborkaszab for the PR and everyone for the reviews!

github-actions bot added API parquet core ORC labels Nov 20, 2025

gaborkaszab force-pushed the main_partition_stat_scan_api branch 3 times, most recently from 76e2686 to 3b4efed Compare November 20, 2025 10:59

gaborkaszab requested review from ajantha-bhat, nastra and pvary November 20, 2025 11:01

ajantha-bhat reviewed Nov 25, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/Table.java Show resolved Hide resolved

gaborkaszab requested a review from ajantha-bhat December 1, 2025 09:18

pvary approved these changes Dec 12, 2025

View reviewed changes

nastra requested a review from amogh-jahagirdar December 12, 2025 12:10

nastra reviewed Dec 12, 2025

View reviewed changes

core/src/test/java/org/apache/iceberg/PartitionStatisticsTestBase.java Outdated Show resolved Hide resolved

core/src/test/java/org/apache/iceberg/TestBasePartitionStatisticsScan.java Outdated Show resolved Hide resolved

findinpath reviewed Dec 15, 2025

View reviewed changes

API, Core: Scan API for partition stats

7733b0d

This PR adds a new API to scan partition statistics, and provides an implementation for this new API. Deprecates the old way of querying partition stats, however, doesn't replace the usage of the deprecated functionality with the new API.

gaborkaszab force-pushed the main_partition_stat_scan_api branch from 3b4efed to 7733b0d Compare December 15, 2025 15:44

pvary merged commit 51d548a into apache:main Jan 7, 2026
45 checks passed

This was referenced Jan 7, 2026

Core: Use scan API to read partition stats #14989

Merged

Core, Data, Spark: Use partition stats scan API in tests #14996

Merged

PoC: API, Core, Spark: Scan API for partition stats #14508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API, Core: Scan API for partition stats#14640

API, Core: Scan API for partition stats#14640
pvary merged 1 commit intoapache:mainfrom
gaborkaszab:main_partition_stat_scan_api

gaborkaszab commented Nov 20, 2025 •

edited

Loading

Uh oh!

gaborkaszab commented Nov 20, 2025

Uh oh!

gaborkaszab commented Nov 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

gaborkaszab commented Dec 1, 2025

Uh oh!

gaborkaszab commented Dec 9, 2025

Uh oh!

pvary left a comment

Uh oh!

nastra commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

findinpath Dec 15, 2025

Uh oh!

gaborkaszab Dec 15, 2025

Uh oh!

pvary Dec 15, 2025

Uh oh!

Uh oh!

pvary commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gaborkaszab commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborkaszab commented Nov 20, 2025

Uh oh!

gaborkaszab commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gaborkaszab commented Dec 1, 2025

Uh oh!

gaborkaszab commented Dec 9, 2025

Uh oh!

pvary left a comment

Choose a reason for hiding this comment

Uh oh!

nastra commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

findinpath Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gaborkaszab Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

pvary Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gaborkaszab commented Nov 20, 2025 •

edited

Loading

gaborkaszab commented Nov 20, 2025 •

edited

Loading