Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pydantic's model_copy for model modification when updating table metadata #182

Merged
merged 31 commits into from
Dec 6, 2023

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Dec 5, 2023

Fixes #179

This PR uses Pydantic's model_copy to apply table updates to metadata. Specifically:

  • When applying AddSchema, SetCurrentSchema, AddSnapshot, SetSnapshotRef, we use base_metadata.model_copy(update=...) to get the updated metadata.
  • When applying UpgradeFormatVersion, we still need to rebuild the whole model from TableMetadataV1 to TableMetadataV2

@Fokko Could you please take a look at this when you have a chance? Thanks!

HonahX added 29 commits November 5, 2023 00:32
# Conflicts:
#	pyiceberg/table/__init__.py
#	tests/conftest.py
# Conflicts:
#	tests/conftest.py
…able_metadata_update_model_copy

# Conflicts:
#	pyiceberg/table/__init__.py
# Conflicts:
#	pyiceberg/table/__init__.py
#	tests/table/test_init.py
@@ -533,6 +535,8 @@ def update_table_metadata(base_metadata: TableMetadata, updates: Tuple[TableUpda
for update in updates:
new_metadata = _apply_table_update(update, new_metadata, context)

# Rebuild metadata to trigger validation
new_metadata = TableMetadataUtil.parse_obj(copy(new_metadata.model_dump()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since model_copy performs a shallow copy by default, I believe we need to execute a deep copy before returning the final new_metadata. Otherwise, base_metadata might be inadvertently altered due to any improper updates applied to new_metadata subsequently.

Furthermore, as indicated by pydantic/pydantic#418 and by some local tests, model_copy(update=) does not validate the contents of update. I think it might be good to reconstruct the metadata at this point to initiate the validation process.

(Alternatively, we could perform a deep copy here and incorporate the validation into our unit tests. Open to discussion on this approach.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're okay to skip the validation on the Pydantic side, we should validate the input anyway to generate a more meaningful error.

We could set it to deep-copy using the deep=True kwarg. Doing this in Pydantic is probably an order of magnitude more efficient than doing this in Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I changed to model_copy(deep=True) here and moved the validation to a unit test

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @HonahX! Thanks for fixing this right away.

One thing I wondered if it is better to do a deep-copy when returning the final metadata, or just deep-copy in the steps where we update the metadata. But I think the current approach is more efficient when there are several updates.

@Fokko Fokko merged commit a368bd9 into apache:main Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid dictionary (de)serialization for model modification
2 participants