API, Core: Scan API for partition stats#14640
Conversation
76e2686 to
3b4efed
Compare
|
Hey @nastra , |
|
Hey @nastra @ajantha-bhat @pvary , |
|
@nastra @ajantha-bhat Do you think you can take a look at this? You were involved with stats stuff recently. |
pvary
left a comment
There was a problem hiding this comment.
Looks good to me.
Let's wait a bit for others to chime in if they want. I will merge if there are no new comments.
|
sorry I've been busy with internal things but I'll try to review this next week |
core/src/test/java/org/apache/iceberg/PartitionStatisticsTestBase.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestBasePartitionStatisticsScan.java
Outdated
Show resolved
Hide resolved
| .filter(f -> f.snapshotId() == snapshotId) | ||
| .findFirst(); | ||
|
|
||
| if (statsFile.isEmpty()) { |
There was a problem hiding this comment.
If the snapshot is missing partition stats, maybe a previous snapshot may have them.
Any partition stats may be better than no partition stats.
There was a problem hiding this comment.
Thanks for taking a look, @findinpath !
I think in general the "get me the latest available stats" use-case makes sense. However, it would introduce much ambiguity into the API because we won't know how fresh the returned stats are. I think we have 2 option here:
- introduce a way to provide a snapshot ID (other than
useSnapshot()) that will explicitly be used for these latest available searches. I think such a functionality should return somehow that how many steps were required to get the stats, and maybe some metrics about the stats themselves that doesn't have partition stats in the chain. E.g. 3 steps had to make to find partition stats and the skipped snapshots added 12 data data files etc. The engine can then judge whether using such stats makes sense or not. - Let the engine do the work to find the snapshot with partition stats. This way the engine directly can judge if it makes sense to use such stats. (e.g. already skipped half of the snapshots, won't continue following the chain). I think this is the cleaner approach at the moment. For this
PartitionStatsHandler.latestStatsFile()could help how to find the latest snapshot having partition stats.
There was a problem hiding this comment.
I’ll go with the second option for now. If we later identify a common or generic need to fetch statistics that are "around" a specific snapshot, we can introduce a new API method or add a utility function in the Iceberg core
This PR adds a new API to scan partition statistics, and provides an implementation for this new API. Deprecates the old way of querying partition stats, however, doesn't replace the usage of the deprecated functionality with the new API.
3b4efed to
7733b0d
Compare
|
Merged to main. |
Background: The current the way to query partition stats is through PartitionStatsHandler.readPartitionStatsFile(). For this the user has to put together the schema and get the input file to read. It would be beneficial for easier usability to have a more convenient API to scan partition stats similar to the other scan APIs. This could also have filter and projection capabilities for better read performance.
Context: #14508 contains the changes required to introduce the new API covering the read functionalities existing today. To reduce the scope and the size of the review it's split into multiple steps and this one is the first with the following scope:
However, the usage of the deprecated functionality is not replace with the new API. That's a follow-up step.
For more details see #14508