PoC: API, Core, Spark: Scan API for partition stats #14508

gaborkaszab · 2025-11-05T11:30:47Z

No description provided.

gaborkaszab · 2025-11-05T13:03:07Z

Some background: The current the way to query partition stats is through PartitionStatsHandler.readPartitionStatsFile(). For the the user has to put together the schema and get the input file to read. It would be beneficial for easier usability (also one comment on my stats proposal doc mentions) to have a more convenient API to scan partition stats. This could also have filter and projection capabilities.

The content of this PR:

Introduce PartitionStatisticsScan API and its implementation BasePartitionStatisticsScan in core. For simplicity this has the functionality that exists today, no filtering by partition, no projection.
Replace the usage of PartitionStatsHandler.readPartitionStatsFile() with the new API
Introduce PartitionStatistics interface into the API module, make PartitionStats in core to derive from this. This is needed so that the Scan API could use this as return value, while the existing PartitionStats class is in core module.
Replace the usage of PartitionStats whenever possible with the new interface.

These could possibly be some follow-up steps:

Implementation of filter() and project() on the new Scan API
The naming of affected classes is a bit weird: interface api/PartitionStatistics that is implemented by core/PartitionStats. Ideally the name of the implementation would be BasePartitionStatistics. As a next step we can introduce a class with the same content and new name and deprecate the existing one, also remove usage. Changes within PartitionStats are easier to review in case "renaming" happens in a follow-up PR.
Older Spark versions should be covered
Producing Schema for projection as part of the api/PartitionStatistics interface (similarly to api/DataFile). Currently core/PartitionStatsHandler.schema() can produce V2/V3 schemas that could be given to the projection, but such functionality is better in API module, also some further flexibility might be required.

gaborkaszab · 2025-11-05T13:06:09Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

                value,
                (existingEntry, newEntry) -> {
-                  existingEntry.appendStats(newEntry);
+                  ((PartitionStats) existingEntry).appendStats(newEntry);


If PartitionStatistics interface had the appendStats function, this cast (and another occurrence) wouldn't be needed. It seemed a bit weird to have it there, but I'm open to make this change to clean up casts.

I would prefer to keep the interface clean

pvary · 2025-11-06T09:44:43Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+  PartitionStatisticsScan filter(Expression filter);
+
+  /**
+   * Create a new scan from this with the schema as its projection.


Maybe describe what will happen with the PartitionStatistics attributes which are not part of the schema.

Thank for pointing this out! My initial plan was to always query the 'traditional' partition stats and allow projection for the column-based once when we introduce them later. But it makes sense to project also the existing ones, and then the current design with PartitionStatistics isn't suitable for that. Let me wrap my head around this and come back with a different design that can tackle this too.

pvary · 2025-11-06T09:47:25Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+  /**
+   * Create a new scan from this with the schema as its projection.
+   *
+   * @param schema a projection schema


How does the user create the Schema?

I would prefer something like the DataFile where the possible columns are available as constants, and the type is available as well. Maybe copy/move/deprecate the schema from the old place.

You're right. Let me add this to the possible follow-up steps in my comment at the top

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

pvary · 2025-11-06T10:00:31Z

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

+  }
+
+  @Override
+  public CloseableIterable<PartitionStatistics> scan() {


Do we have tests for this?

I haven't introduced dedicated tests for this because now the capabilities is the same as we had before this patch, and TestPartitionStatsHandler covers it. Let's finalize the API part and I'll add a separate test suite too if you think it makes sense at this point.

I added a test suite for this too. Introduced a base class to share functionality with TestBasePartitionStatsHandler

gaborkaszab

Thanks for the review, @pvary !
To tackle future projection I changed the primitive members of PartitionStatistics to objects so that we can leave them null if not queries.
I created a new BasePartitionStatistics class to implement the above change. It is mainly a copy-paste form PartitionStats (that I made deprecated) with the following changes:

member stats are objects and not primitives
In the constructor for the write path I initialize the necessary stats to zero to avoid writing nulls for required fields. Otherwise we'd get NPE from the writers
The class inherits from SupportsIndexProjection instead of StructType, hence implements internalGet and internalSet.
There is a new constructor for the read path that accepts a projection Schema. It doesn't call the particular super() constructor that is needed for projections instead of full read, left a TODO comment in the code for this.

Once finalizing the API, I plan to add a test suite for the new scan API, and also one for the new class BasePartitionStatistics.

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

gaborkaszab · 2025-11-11T08:21:46Z

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java

+  }
+
+  @Override
+  public CloseableIterable<PartitionStatistics> scan() {


I haven't introduced dedicated tests for this because now the capabilities is the same as we had before this patch, and TestPartitionStatsHandler covers it. Let's finalize the API part and I'll add a separate test suite too if you think it makes sense at this point.

gaborkaszab · 2025-11-11T08:23:54Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+  PartitionStatisticsScan filter(Expression filter);
+
+  /**
+   * Create a new scan from this with the schema as its projection.


Thank for pointing this out! My initial plan was to always query the 'traditional' partition stats and allow projection for the column-based once when we introduce them later. But it makes sense to project also the existing ones, and then the current design with PartitionStatistics isn't suitable for that. Let me wrap my head around this and come back with a different design that can tackle this too.

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

pvary · 2025-11-13T11:38:43Z

core/src/main/java/org/apache/iceberg/BasePartitionStatistics.java

+    this.dataRecordCount += entry.dataRecordCount();
+    this.dataFileCount += entry.dataFileCount();
+    this.totalDataFileSizeInBytes += entry.totalDataFileSizeInBytes();
+    this.positionDeleteRecordCount += entry.positionDeleteRecordCount();
+    this.positionDeleteFileCount += entry.positionDeleteFileCount();
+    this.equalityDeleteRecordCount += entry.equalityDeleteRecordCount();
+    this.equalityDeleteFileCount += entry.equalityDeleteFileCount();


What happens when one of these are null?

They can't be null, because this is on the write path where we use the full V2 or V3 schema for read/write. Added a comment

Maybe do a precondition check on them?

pvary · 2025-11-13T11:39:44Z

core/src/main/java/org/apache/iceberg/BasePartitionStatistics.java

+  private Long totalRecordCount; // null by default
+  private Long lastUpdatedAt; // null by default
+  private Long lastUpdatedSnapshotId; // null by default


It is hard to understand the comment.

Are the others not null by default?

You're right, these comments are misleading now. Originally, for the write path these were true, but now on the read path any of the stats can be null. Removed the comments.

pvary · 2025-11-13T11:46:10Z

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java

+   * @deprecated will be removed in 1.12.0, use {@link PartitionStatisticsScan} instead
   */
+  @Deprecated
  public static CloseableIterable<PartitionStats> readPartitionStatsFile(


Do we still have tests which executing the old code path?

I replaced all the calls with the new Scan API, including the tests. So no, now the tests exercise the new way. Since this is deprecated now, I found it's fine. LMK if I missed something.

FYI, I added the relevant tests back that exercise this deprecated function.

Marked the tests exercising the old functionality as deprecated simply for visibility.

pvary · 2025-11-19T09:31:27Z

api/src/main/java/org/apache/iceberg/PartitionStatisticsScan.java

+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.io.CloseableIterable;
+
+public interface PartitionStatisticsScan {


nit: maybe at lease a oneliner javadoc

github-actions · 2026-01-01T00:22:12Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

gaborkaszab · 2026-01-07T10:00:53Z

Let's keep this open until all the relevant PRs are merged

gaborkaszab · 2026-01-13T12:24:23Z

This is split now into 3 parts (PR PR PR) and all of them are either merged or published, so this can be closed now.

github-actions bot added API spark core data labels Nov 5, 2025

gaborkaszab commented Nov 5, 2025

View reviewed changes

gaborkaszab requested review from ajantha-bhat, nastra, pvary and rdblue November 5, 2025 13:06

pvary reviewed Nov 6, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/BasePartitionStatisticsScan.java Show resolved Hide resolved

pvary reviewed Nov 6, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Show resolved Hide resolved

pvary reviewed Nov 6, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java Outdated Show resolved Hide resolved

pvary reviewed Nov 6, 2025

View reviewed changes

gaborkaszab force-pushed the main_partition_stat_scan branch from fb06b8a to f67c319 Compare November 13, 2025 07:54

gaborkaszab commented Nov 13, 2025

View reviewed changes

gaborkaszab requested a review from pvary November 13, 2025 08:05

gaborkaszab force-pushed the main_partition_stat_scan branch from f67c319 to feccefc Compare November 13, 2025 09:32

pvary reviewed Nov 13, 2025

View reviewed changes

gaborkaszab force-pushed the main_partition_stat_scan branch from feccefc to 9df7bd6 Compare November 17, 2025 16:07

github-actions bot added parquet ORC labels Nov 17, 2025

gaborkaszab force-pushed the main_partition_stat_scan branch 2 times, most recently from 2ec029b to 706a56f Compare November 17, 2025 20:33

gaborkaszab requested a review from pvary November 17, 2025 21:02

gaborkaszab force-pushed the main_partition_stat_scan branch from 706a56f to d97a375 Compare November 18, 2025 10:19

gaborkaszab force-pushed the main_partition_stat_scan branch 2 times, most recently from 17dc544 to a7f4f6b Compare November 18, 2025 11:02

pvary reviewed Nov 19, 2025

View reviewed changes

gaborkaszab mentioned this pull request Nov 19, 2025

Core, Parquet: Add filter to InternalData #14630

Closed

API, Core, Spark: Scan API for partition stats

dc4f6a4

gaborkaszab force-pushed the main_partition_stat_scan branch from a7f4f6b to dc4f6a4 Compare November 19, 2025 16:01

gaborkaszab mentioned this pull request Nov 20, 2025

API, Core: Scan API for partition stats #14640

Merged

gaborkaszab changed the title ~~API, Core, Spark: Scan API for partition stats~~ PoC: API, Core, Spark: Scan API for partition stats Dec 1, 2025

github-actions bot added the stale label Jan 1, 2026

gaborkaszab removed the stale label Jan 7, 2026

gaborkaszab closed this Jan 13, 2026

PoC: API, Core, Spark: Scan API for partition stats #14508

PoC: API, Core, Spark: Scan API for partition stats #14508

Uh oh!

Conversation

gaborkaszab commented Nov 5, 2025

Uh oh!

gaborkaszab commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborkaszab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

gaborkaszab commented Jan 7, 2026

Uh oh!

gaborkaszab commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaborkaszab commented Nov 5, 2025 •

edited

Loading