Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement update for remove-snapshots action #1561

Merged
merged 15 commits into from
Feb 17, 2025

Conversation

grihabor
Copy link
Contributor

No description provided.

@grihabor grihabor changed the title Implement update for remove-snapshot action Implement update for remove-snapshots action Jan 22, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This was one of the missing update functions mentioned in #952

Do you mind also including some tests? Similar to

def test_apply_remove_properties_update(table_v2: Table) -> None:
base_metadata = update_table_metadata(
table_v2.metadata,
(SetPropertiesUpdate(updates={"test_a": "test_a", "test_b": "test_b", "test_c": "test_c", "test_d": "test_d"}),),
)
new_metadata_no_removal = update_table_metadata(base_metadata, (RemovePropertiesUpdate(removals=[]),))
assert base_metadata == new_metadata_no_removal
new_metadata = update_table_metadata(base_metadata, (RemovePropertiesUpdate(removals=["test_a", "test_c"]),))
assert base_metadata.properties == {
"read.split.target.size": "134217728",
"test_a": "test_a",
"test_b": "test_b",
"test_c": "test_c",
"test_d": "test_d",
}
assert new_metadata.properties == {"read.split.target.size": "134217728", "test_b": "test_b", "test_d": "test_d"}

@grihabor
Copy link
Contributor Author

Sure! Thanks for the fast review

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, looks like theres also a linter error, do you mind running make lint locally?

@@ -455,6 +455,19 @@ def _(update: SetSnapshotRefUpdate, base_metadata: TableMetadata, context: _Tabl
return base_metadata.model_copy(update=metadata_updates)


@_apply_table_update.register(RemoveSnapshotsUpdate)
def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
for remove_snapshot_id in update.snapshot_ids:
if remove_snapshot_id == base_metadata.current_snapshot_id:
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we block the current snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in iceberg spec, but it's not clear what should happen if you try to remove the current snapshot.

I'm also not sure if I should update parent_snapshot_id in every snapshot that was referencing removed snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to set parent_snapshot_id to None if the parent is gone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear what should happen if you try to remove the current snapshot.

im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate pr for remove-snapshot-ref and added a unit test there #1598

@grihabor
Copy link
Contributor Author

Hey @kevinjqliu, ready for another review round. I had to cherry pick the changes from #822 to reuse the code that removes refs

@grihabor
Copy link
Contributor Author

Fixed the linters

@grihabor
Copy link
Contributor Author

Removed corresponding statistics files and entries from snapshot log

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments to the PR. I think we might need some integrations tests to make sure the behavior aligns with the java implementation.

Also, looks like remove_tag and remove_branch are unrelated to this PR, perhaps we can move them to a separate PR.

def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
for remove_snapshot_id in update.snapshot_ids:
if remove_snapshot_id == base_metadata.current_snapshot_id:
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear what should happen if you try to remove the current snapshot.

im looking at the java implementation for answers, i think you can just remove the current snapshot... because you can have an empty table with no snapshots

assert len(table_v2.metadata.snapshots) == 2
assert len(table_v2.metadata.snapshot_log) == 2
assert len(table_v2.metadata.refs) == 2
update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit can you make 3051729675574597004 a constant for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT

with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"):
update_table_metadata(table_v2.metadata, (update,))


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also add some tests for RemoveSnapshotRefUpdate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate pr for remove-snapshot-ref and added a unit test there #1598

@grihabor grihabor force-pushed the remove-snapshots-update branch from 28c6657 to 19e17f0 Compare February 1, 2025 09:45
@grihabor grihabor force-pushed the remove-snapshots-update branch from c8c63ea to 32e1e85 Compare February 1, 2025 09:53
Copy link
Contributor Author

@grihabor grihabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review! Could you explain which kind of integration tests you want? Like pyspark integration with expire_snapshots call?

def _(update: RemoveSnapshotsUpdate, base_metadata: TableMetadata, context: _TableMetadataUpdateContext) -> TableMetadata:
for remove_snapshot_id in update.snapshot_ids:
if remove_snapshot_id == base_metadata.current_snapshot_id:
raise ValueError(f"Can't remove current snapshot id {remove_snapshot_id}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate pr for remove-snapshot-ref and added a unit test there #1598

with pytest.raises(ValueError, match="Can't remove current snapshot id 3055729675574597004"):
update_table_metadata(table_v2.metadata, (update,))


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate pr for remove-snapshot-ref and added a unit test there #1598

assert len(table_v2.metadata.snapshots) == 2
assert len(table_v2.metadata.snapshot_log) == 2
assert len(table_v2.metadata.refs) == 2
update = RemoveSnapshotsUpdate(snapshot_ids=[3051729675574597004])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added constants REMOVE_SNAPSHOT and KEEP_SNAPSHOT

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments, but this looks great @grihabor


snapshots = [
(s.model_copy(update={"parent_snapshot_id": None}) if s.parent_snapshot_id in update.snapshot_ids else s)
for s in base_metadata.snapshots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a little more verbose:

Suggested change
for s in base_metadata.snapshots
for snapshot in base_metadata.snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure fb8f350

Fokko added a commit that referenced this pull request Feb 14, 2025
Part of #1561
Closes #822

---------

Co-authored-by: Fokko Driesprong <fokko@apache.org>
@kevinjqliu
Copy link
Contributor

now that #1598 is merged, @grihabor could you rebase main?

@grihabor
Copy link
Contributor Author

@kevinjqliu Sure, merged wth upstream main

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu kevinjqliu merged commit 19148d3 into apache:main Feb 17, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants