Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unneeded metadata read during update event generation #11829

Conversation

grantatspothero
Copy link
Contributor

Followup from this PR:
#10523

The above PR removed unnecessary objectstore reads after commit, but there was 1 I missed. SnapshotProducer.notifyListeners has the same problem of reading metadata from objectstore instead of just reading the in memory committed Snapshot object.

@github-actions github-actions bot added the core label Dec 19, 2024
@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch 2 times, most recently from ad81312 to 28984fe Compare December 19, 2024 21:28
@@ -475,10 +475,14 @@ public void commit() {
}
}

Object updateEvent(Snapshot committedSnapshot) {
Copy link
Contributor Author

@grantatspothero grantatspothero Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PendingUpdate.updateEvent only usage is in SnapshotProducer currently so could change the interface directly to updateEvent(Snapshot committedSnapshot), but did not want to make a backwards incompatible API change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotProducer is package private, so I think we're OK in terms of backwards compatibility since it's not like a public API is being broken.

Copy link
Contributor Author

@grantatspothero grantatspothero Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was trying to provide context why I didn't modify PendingUpdate.updateEvent directly (it is a public API)

But agree with you is safe to modify SnapshotProducer here.

@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch 2 times, most recently from a710e1c to 072bacf Compare December 20, 2024 18:43
@amogh-jahagirdar amogh-jahagirdar self-requested a review January 1, 2025 01:48
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grantatspothero , I agree with the principle idea of this change to derive the update event from the committed snapshot, no need to potentially read the whole metadata again right after the commit. I just had some comments on the implementation ; it'd also be ideal to have some tests which verify the produced event has the expected properties.

There's a broad question for the need for the listener API since I think these days at least for the commit path, the commit report sent to REST implementations has all those details but there's probably legitimate use cases for non-REST cases or even just generic patterns (sending events to a queue or whatnot). The interface is pretty straightforward/lightweight, and users can have whatever complexity they want in their own implementations.

Comment on lines 163 to 162
ValidationException.check(
snapshotId == committedSnapshot.snapshotId(),
"Committed snapshotId %s does not match expected snapshotId %s",
committedSnapshot.snapshotId(),
snapshotId);
long sequenceNumber = committedSnapshot.sequenceNumber();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we uplevel this logic, it's common to all the implementations?

@@ -475,10 +475,14 @@ public void commit() {
}
}

Object updateEvent(Snapshot committedSnapshot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotProducer is package private, so I think we're OK in terms of backwards compatibility since it's not like a public API is being broken.

@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch from 072bacf to 9ca74ef Compare January 3, 2025 22:14
@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch 4 times, most recently from 7a2611c to 54ba801 Compare January 6, 2025 19:48
@@ -114,7 +114,10 @@ public void testMetadataFileLocationsWithMissingFiles() {
// delete v3.metadata.json making v2.metadata.json and v1.metadata.json inaccessible
table.io().deleteFile(location);

Set<String> metadataFileLocations = ReachableFileUtil.metadataFileLocations(table, true);
// original hadoop table will not see the file deletion
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is side effect of removing refresh() calls during update event generation. Code which relied upon table.refresh being called for correctness will now break.

This is mostly a problem for hadoop/filesystem tables and these are deprecated: https://iceberg.apache.org/spec/#file-system-tables

For metastore catalogs for example, table.refresh() can be manually called after commits. This does not work for filesystem tables hence the creating of a new table.

@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch from 54ba801 to b12611c Compare January 6, 2025 22:00
avoids unnecessary iceberg metadata read of just committed snapshot
@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch from b12611c to 3887b94 Compare January 6, 2025 22:01
@@ -255,8 +257,13 @@ void failCommits(int numFailures) {
this.failCommits = numFailures;
}

int getMetadataFetchCount() {
Copy link
Contributor Author

@grantatspothero grantatspothero Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to count calls to current()/refresh() as a proxy for reading metadata from objectstorage. For TestTables (which uses in memory metadata) that is the best we can do.

Is there some other test I could write that uses real file system metadata (not TestTables)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why current/refresh counting is not great: cached metadata fetches are free while uncached fetches are not, and this does not differentiate between the two.

Instrumenting at the fileio level would let us validate indeed only a single uncached metadata request is needed for the whole commit.

@grantatspothero grantatspothero force-pushed the gn/removeUnneededMetadataReadUpdateEvent branch from f0e46c9 to 3887b94 Compare January 13, 2025 21:37
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 13, 2025
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants