-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unneeded metadata read during update event generation #11829
Remove unneeded metadata read during update event generation #11829
Conversation
ad81312
to
28984fe
Compare
@@ -475,10 +475,14 @@ public void commit() { | |||
} | |||
} | |||
|
|||
Object updateEvent(Snapshot committedSnapshot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PendingUpdate.updateEvent
only usage is in SnapshotProducer currently so could change the interface directly to updateEvent(Snapshot committedSnapshot)
, but did not want to make a backwards incompatible API change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SnapshotProducer
is package private, so I think we're OK in terms of backwards compatibility since it's not like a public API is being broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was trying to provide context why I didn't modify PendingUpdate.updateEvent
directly (it is a public API)
But agree with you is safe to modify SnapshotProducer here.
a710e1c
to
072bacf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @grantatspothero , I agree with the principle idea of this change to derive the update event from the committed snapshot, no need to potentially read the whole metadata again right after the commit. I just had some comments on the implementation ; it'd also be ideal to have some tests which verify the produced event has the expected properties.
There's a broad question for the need for the listener API since I think these days at least for the commit path, the commit report sent to REST implementations has all those details but there's probably legitimate use cases for non-REST cases or even just generic patterns (sending events to a queue or whatnot). The interface is pretty straightforward/lightweight, and users can have whatever complexity they want in their own implementations.
ValidationException.check( | ||
snapshotId == committedSnapshot.snapshotId(), | ||
"Committed snapshotId %s does not match expected snapshotId %s", | ||
committedSnapshot.snapshotId(), | ||
snapshotId); | ||
long sequenceNumber = committedSnapshot.sequenceNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we uplevel this logic, it's common to all the implementations?
@@ -475,10 +475,14 @@ public void commit() { | |||
} | |||
} | |||
|
|||
Object updateEvent(Snapshot committedSnapshot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SnapshotProducer
is package private, so I think we're OK in terms of backwards compatibility since it's not like a public API is being broken.
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
Outdated
Show resolved
Hide resolved
072bacf
to
9ca74ef
Compare
7a2611c
to
54ba801
Compare
@@ -114,7 +114,10 @@ public void testMetadataFileLocationsWithMissingFiles() { | |||
// delete v3.metadata.json making v2.metadata.json and v1.metadata.json inaccessible | |||
table.io().deleteFile(location); | |||
|
|||
Set<String> metadataFileLocations = ReachableFileUtil.metadataFileLocations(table, true); | |||
// original hadoop table will not see the file deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is side effect of removing refresh()
calls during update event generation. Code which relied upon table.refresh being called for correctness will now break.
This is mostly a problem for hadoop/filesystem tables and these are deprecated: https://iceberg.apache.org/spec/#file-system-tables
For metastore catalogs for example, table.refresh()
can be manually called after commits. This does not work for filesystem tables hence the creating of a new table.
54ba801
to
b12611c
Compare
avoids unnecessary iceberg metadata read of just committed snapshot
b12611c
to
3887b94
Compare
@@ -255,8 +257,13 @@ void failCommits(int numFailures) { | |||
this.failCommits = numFailures; | |||
} | |||
|
|||
int getMetadataFetchCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to count calls to current()/refresh() as a proxy for reading metadata from objectstorage. For TestTables (which uses in memory metadata) that is the best we can do.
Is there some other test I could write that uses real file system metadata (not TestTables)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why current/refresh counting is not great: cached metadata fetches are free while uncached fetches are not, and this does not differentiate between the two.
Instrumenting at the fileio level would let us validate indeed only a single uncached metadata request is needed for the whole commit.
f0e46c9
to
3887b94
Compare
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Followup from this PR:
#10523
The above PR removed unnecessary objectstore reads after commit, but there was 1 I missed.
SnapshotProducer.notifyListeners
has the same problem of reading metadata from objectstore instead of just reading the in memory committed Snapshot object.