Skip to content

API, Core: Scan API for partition stats#14640

Merged
pvary merged 1 commit intoapache:mainfrom
gaborkaszab:main_partition_stat_scan_api
Jan 7, 2026
Merged

API, Core: Scan API for partition stats#14640
pvary merged 1 commit intoapache:mainfrom
gaborkaszab:main_partition_stat_scan_api

Conversation

@gaborkaszab
Copy link
Collaborator

@gaborkaszab gaborkaszab commented Nov 20, 2025

Background: The current the way to query partition stats is through PartitionStatsHandler.readPartitionStatsFile(). For this the user has to put together the schema and get the input file to read. It would be beneficial for easier usability to have a more convenient API to scan partition stats similar to the other scan APIs. This could also have filter and projection capabilities for better read performance.

Context: #14508 contains the changes required to introduce the new API covering the read functionalities existing today. To reduce the scope and the size of the review it's split into multiple steps and this one is the first with the following scope:

  • Introducing a new API to scan partition statistics
  • Providing an implementation for this new API
  • Deprecating the old way of querying partition stats
    However, the usage of the deprecated functionality is not replace with the new API. That's a follow-up step.

For more details see #14508

@gaborkaszab
Copy link
Collaborator Author

Note, this is split from this PR concentrating on introducing, implementing and testing the new API. In #14508 you can see how it would look like to remove the usage from the deprecated functionality.

@gaborkaszab gaborkaszab force-pushed the main_partition_stat_scan_api branch 3 times, most recently from 76e2686 to 3b4efed Compare November 20, 2025 10:59
@gaborkaszab
Copy link
Collaborator Author

gaborkaszab commented Nov 20, 2025

Hey @nastra ,
You were busy working on stats recently, do you think you can a look at this? The original PR was #14508 that contains more details and also replacing the deprecated functionality with the new one. To keep the PR small, I split the introduction of new functionality into this separate PR.

@gaborkaszab
Copy link
Collaborator Author

Hey @nastra @ajantha-bhat @pvary ,
I ping you because you all were involved with stats for some extent. Would you mind taking a look at this? Note, I have a PoC PR that contains the full picture, also replacing old usage and all. This is is a split from that introducing the new API and it's implementation, but doesn't replace usage of old functionality.

@gaborkaszab
Copy link
Collaborator Author

@nastra @ajantha-bhat Do you think you can take a look at this? You were involved with stats stuff recently.

Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Let's wait a bit for others to chime in if they want. I will merge if there are no new comments.

@nastra
Copy link
Contributor

nastra commented Dec 12, 2025

sorry I've been busy with internal things but I'll try to review this next week

.filter(f -> f.snapshotId() == snapshotId)
.findFirst();

if (statsFile.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the snapshot is missing partition stats, maybe a previous snapshot may have them.

Any partition stats may be better than no partition stats.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @findinpath !
I think in general the "get me the latest available stats" use-case makes sense. However, it would introduce much ambiguity into the API because we won't know how fresh the returned stats are. I think we have 2 option here:

  1. introduce a way to provide a snapshot ID (other than useSnapshot()) that will explicitly be used for these latest available searches. I think such a functionality should return somehow that how many steps were required to get the stats, and maybe some metrics about the stats themselves that doesn't have partition stats in the chain. E.g. 3 steps had to make to find partition stats and the skipped snapshots added 12 data data files etc. The engine can then judge whether using such stats makes sense or not.
  2. Let the engine do the work to find the snapshot with partition stats. This way the engine directly can judge if it makes sense to use such stats. (e.g. already skipped half of the snapshots, won't continue following the chain). I think this is the cleaner approach at the moment. For this PartitionStatsHandler.latestStatsFile() could help how to find the latest snapshot having partition stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll go with the second option for now. If we later identify a common or generic need to fetch statistics that are "around" a specific snapshot, we can introduce a new API method or add a utility function in the Iceberg core

This PR adds a new API to scan partition statistics, and provides an
implementation for this new API. Deprecates the old way of querying
partition stats, however, doesn't replace the usage of the deprecated
functionality with the new API.
@gaborkaszab gaborkaszab force-pushed the main_partition_stat_scan_api branch from 3b4efed to 7733b0d Compare December 15, 2025 15:44
@pvary pvary merged commit 51d548a into apache:main Jan 7, 2026
45 checks passed
@pvary
Copy link
Contributor

pvary commented Jan 7, 2026

Merged to main.
Thanks @gaborkaszab for the PR and everyone for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants