-
Notifications
You must be signed in to change notification settings - Fork 3k
PoC: API, Core, Spark: Scan API for partition stats #14508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some background: The current the way to query partition stats is through The content of this PR:
These could possibly be some follow-up steps:
|
| value, | ||
| (existingEntry, newEntry) -> { | ||
| existingEntry.appendStats(newEntry); | ||
| ((PartitionStats) existingEntry).appendStats(newEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PartitionStatistics interface had the appendStats function, this cast (and another occurrence) wouldn't be needed. It seemed a bit weird to have it there, but I'm open to make this change to clean up casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep the interface clean
| PartitionStatisticsScan filter(Expression filter); | ||
|
|
||
| /** | ||
| * Create a new scan from this with the schema as its projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe describe what will happen with the PartitionStatistics attributes which are not part of the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for pointing this out! My initial plan was to always query the 'traditional' partition stats and allow projection for the column-based once when we introduce them later. But it makes sense to project also the existing ones, and then the current design with PartitionStatistics isn't suitable for that. Let me wrap my head around this and come back with a different design that can tackle this too.
| /** | ||
| * Create a new scan from this with the schema as its projection. | ||
| * | ||
| * @param schema a projection schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user create the Schema?
I would prefer something like the DataFile where the possible columns are available as constants, and the type is available as well. Maybe copy/move/deprecate the schema from the old place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Let me add this to the possible follow-up steps in my comment at the top
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public CloseableIterable<PartitionStatistics> scan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't introduced dedicated tests for this because now the capabilities is the same as we had before this patch, and TestPartitionStatsHandler covers it. Let's finalize the API part and I'll add a separate test suite too if you think it makes sense at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test suite for this too. Introduced a base class to share functionality with TestBasePartitionStatsHandler
fb06b8a to
f67c319
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @pvary !
To tackle future projection I changed the primitive members of PartitionStatistics to objects so that we can leave them null if not queries.
I created a new BasePartitionStatistics class to implement the above change. It is mainly a copy-paste form PartitionStats (that I made deprecated) with the following changes:
- member stats are objects and not primitives
- In the constructor for the write path I initialize the necessary stats to zero to avoid writing nulls for required fields. Otherwise we'd get NPE from the writers
- The class inherits from SupportsIndexProjection instead of StructType, hence implements
internalGetandinternalSet. - There is a new constructor for the read path that accepts a projection Schema. It doesn't call the particular super() constructor that is needed for projections instead of full read, left a TODO comment in the code for this.
Once finalizing the API, I plan to add a test suite for the new scan API, and also one for the new class BasePartitionStatistics.
core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public CloseableIterable<PartitionStatistics> scan() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't introduced dedicated tests for this because now the capabilities is the same as we had before this patch, and TestPartitionStatsHandler covers it. Let's finalize the API part and I'll add a separate test suite too if you think it makes sense at this point.
| PartitionStatisticsScan filter(Expression filter); | ||
|
|
||
| /** | ||
| * Create a new scan from this with the schema as its projection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for pointing this out! My initial plan was to always query the 'traditional' partition stats and allow projection for the column-based once when we introduce them later. But it makes sense to project also the existing ones, and then the current design with PartitionStatistics isn't suitable for that. Let me wrap my head around this and come back with a different design that can tackle this too.
f67c319 to
feccefc
Compare
| this.dataRecordCount += entry.dataRecordCount(); | ||
| this.dataFileCount += entry.dataFileCount(); | ||
| this.totalDataFileSizeInBytes += entry.totalDataFileSizeInBytes(); | ||
| this.positionDeleteRecordCount += entry.positionDeleteRecordCount(); | ||
| this.positionDeleteFileCount += entry.positionDeleteFileCount(); | ||
| this.equalityDeleteRecordCount += entry.equalityDeleteRecordCount(); | ||
| this.equalityDeleteFileCount += entry.equalityDeleteFileCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when one of these are null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can't be null, because this is on the write path where we use the full V2 or V3 schema for read/write. Added a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do a precondition check on them?
| private Long totalRecordCount; // null by default | ||
| private Long lastUpdatedAt; // null by default | ||
| private Long lastUpdatedSnapshotId; // null by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to understand the comment.
Are the others not null by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, these comments are misleading now. Originally, for the write path these were true, but now on the read path any of the stats can be null. Removed the comments.
| * @deprecated will be removed in 1.12.0, use {@link PartitionStatisticsScan} instead | ||
| */ | ||
| @Deprecated | ||
| public static CloseableIterable<PartitionStats> readPartitionStatsFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have tests which executing the old code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced all the calls with the new Scan API, including the tests. So no, now the tests exercise the new way. Since this is deprecated now, I found it's fine. LMK if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I added the relevant tests back that exercise this deprecated function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked the tests exercising the old functionality as deprecated simply for visibility.
feccefc to
9df7bd6
Compare
2ec029b to
706a56f
Compare
706a56f to
d97a375
Compare
17dc544 to
a7f4f6b
Compare
| import org.apache.iceberg.expressions.Expression; | ||
| import org.apache.iceberg.io.CloseableIterable; | ||
|
|
||
| public interface PartitionStatisticsScan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe at lease a oneliner javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
a7f4f6b to
dc4f6a4
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Let's keep this open until all the relevant PRs are merged |
No description provided.