-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement update for remove-snapshots
action
#1561
Conversation
remove-snapshot
actionremove-snapshots
action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This was one of the missing update functions mentioned in #952
Do you mind also including some tests? Similar to
iceberg-python/tests/table/test_init.py
Lines 663 to 682 in 2cd4e78
def test_apply_remove_properties_update(table_v2: Table) -> None: | |
base_metadata = update_table_metadata( | |
table_v2.metadata, | |
(SetPropertiesUpdate(updates={"test_a": "test_a", "test_b": "test_b", "test_c": "test_c", "test_d": "test_d"}),), | |
) | |
new_metadata_no_removal = update_table_metadata(base_metadata, (RemovePropertiesUpdate(removals=[]),)) | |
assert base_metadata == new_metadata_no_removal | |
new_metadata = update_table_metadata(base_metadata, (RemovePropertiesUpdate(removals=["test_a", "test_c"]),)) | |
assert base_metadata.properties == { | |
"read.split.target.size": "134217728", | |
"test_a": "test_a", | |
"test_b": "test_b", | |
"test_c": "test_c", | |
"test_d": "test_d", | |
} | |
assert new_metadata.properties == {"read.split.target.size": "134217728", "test_b": "test_b", "test_d": "test_d"} |
Sure! Thanks for the fast review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments, looks like theres also a linter error, do you mind running make lint
locally?
@@ -455,6 +455,19 @@ def _(update: SetSnapshotRefUpdate, base_metadata: TableMetadata, context: _Tabl | |||
return base_metadata.model_copy(update=metadata_updates) | |||
|
|||
|
|||
@_apply_table_update.register(RemoveSnapshotsUpdate) | |||
def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyiceberg/table/update/__init__.py
Outdated
def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
for remove_snapshot_id in update.snapshot_ids: | ||
if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we block the current snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert in iceberg spec, but it's not clear what should happen if you try to remove the current snapshot.
I'm also not sure if I should update parent_snapshot_id in every snapshot that was referencing removed snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to set parent_snapshot_id to None if the parent is gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what should happen if you try to remove the current snapshot.
im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate pr for remove-snapshot-ref
and added a unit test there #1598
Hey @kevinjqliu, ready for another review round. I had to cherry pick the changes from #822 to reuse the code that removes refs |
Fixed the linters |
Removed corresponding statistics files and entries from snapshot log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few comments to the PR. I think we might need some integrations tests to make sure the behavior aligns with the java implementation.
Also, looks like remove_tag
and remove_branch
are unrelated to this PR, perhaps we can move them to a separate PR.
pyiceberg/table/update/__init__.py
Outdated
def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
for remove_snapshot_id in update.snapshot_ids: | ||
if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what should happen if you try to remove the current snapshot.
im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots
tests/table/test_init.py
Outdated
assert len(table_v2.metadata.snapshots) == 2 | ||
assert len(table_v2.metadata.snapshot_log) == 2 | ||
assert len(table_v2.metadata.refs) == 2 | ||
update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit can you make 3051729675574597004
a constant for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT
tests/table/test_init.py
Outdated
with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"): | ||
update_table_metadata(table_v2.metadata, (update,)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also add some tests for RemoveSnapshotRefUpdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
28c6657
to
19e17f0
Compare
c8c63ea
to
32e1e85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review! Could you explain which kind of integration tests you want? Like pyspark integration with expire_snapshots
call?
pyiceberg/table/update/__init__.py
Outdated
def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata: | ||
for remove_snapshot_id in update.snapshot_ids: | ||
if remove_snapshot_id == base_metadata.current_snapshot_id: | ||
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate pr for remove-snapshot-ref
and added a unit test there #1598
tests/table/test_init.py
Outdated
with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"): | ||
update_table_metadata(table_v2.metadata, (update,)) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate pr for remove-snapshot-ref and added a unit test there #1598
tests/table/test_init.py
Outdated
assert len(table_v2.metadata.snapshots) == 2 | ||
assert len(table_v2.metadata.snapshot_log) == 2 | ||
assert len(table_v2.metadata.refs) == 2 | ||
update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments, but this looks great @grihabor
pyiceberg/table/update/__init__.py
Outdated
|
||
snapshots = [ | ||
(s.model_copy(update={"parent_snapshot_id": None}) if s.parent_snapshot_id in update.snapshot_ids else s) | ||
for s in base_metadata.snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a little more verbose:
for s in base_metadata.snapshots | |
for snapshot in base_metadata.snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure fb8f350
@kevinjqliu Sure, merged wth upstream main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
No description provided.