Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Custom Type (De)Serialization to Signal Schema #348

Merged
merged 5 commits into from
Aug 27, 2024

Conversation

dtulga
Copy link
Contributor

@dtulga dtulga commented Aug 23, 2024

This adds the ability to save and restore model classes (feature classes) to SignalSchema, by saving these in a _custom_types key, then restoring them by creating a dynamic pydantic class with the stored definition. This is useful for when a custom model class is declared and used to create a dataset, but then this dataset is viewed or used in a DataChain somewhere else where this original definition is not available. This will also be useful for any other import / export functions where the original types may need to be changed, such as in parquet import/export, although this will be included in a followup/separate PR(s) to avoid this getting too complex and keep these PRs focused.

This has been tested locally and with Studio. This fixes https://github.com/iterative/studio/issues/10417 and is part of #91

@dtulga dtulga self-assigned this Aug 23, 2024
Copy link

codecov bot commented Aug 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.68%. Comparing base (f0db01d) to head (03666b6).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #348      +/-   ##
==========================================
- Coverage   86.89%   86.68%   -0.22%     
==========================================
  Files          90       90              
  Lines        9923     9945      +22     
  Branches     2003     2009       +6     
==========================================
- Hits         8623     8621       -2     
- Misses        947      975      +28     
+ Partials      353      349       -4     
Flag Coverage Δ
datachain 86.61% <100.00%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Aug 24, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 03666b6
Status: ✅  Deploy successful!
Preview URL: https://d75f7f95.datachain-documentation.pages.dev
Branch Preview URL: https://dtulga-signal-schema-custom.datachain-documentation.pages.dev

View logs

@dtulga dtulga marked this pull request as ready for review August 24, 2024 01:24
@dtulga dtulga requested a review from a team August 24, 2024 01:25
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me as long as it has been tested with Studio 👍

if (
not ModelStore.is_pydantic(fr)
or not issubclass(fr, BaseModel)
or name in ("File", "TextFile", "ImageFile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be useful in future to have the list of these types somewhere as a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, although I'm not sure where to put them. Where do you think would be best for a "Built-In Feature Types" list? (As in, easy to maintain and not miss that these are listed somewhere.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/file.py comes to my mind first, where these models are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there much harm in dropping this condition and serializing custom fields even for those built-in types? That way, even if a future version were to drop or rename any of these, old datasets could still be read as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further consideration, I decided to just serialize those built-in models as well, since the default is to use the existing and defined model in the code first, this should be harmless unless they are renamed or deleted in the future, in which case having them serialized would be useful for reading old datasets.

@@ -69,6 +75,29 @@ def __init__(self, method: str, field):
)


dynamic_feature_models: dict[str, type[BaseModel]] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be _dynamic_feature_models (protected) or even __dynamic_feature_models?

do we use it as a cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do this and custom_types differ from the ModelStore? Is it possible to reuse ModelStore for either of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed it to _dynamic_feature_models (protected) and yes, this is used as a cache.

This is intended more as a cache of the defined dynamic models, instead of the registered list of ModelStore - as this only includes the created dynamic models, not all models. I'm not 100% clear on how ModelStore is used and what caveats there would be with using it for this cache, especially since it does not have any mechanism for creating new models, only registering existing ones. I could look into integrating these two features together if you think it would be useful, but I'm not sure how complex that would be (and what side-effects there might be) and so it should probably be done in a followup.

Copy link
Contributor

@dberenbaum dberenbaum Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried just dropping _dynamic_feature_models and seeing if it makes any difference? AFAIU because you use DataModel as the base when creating the dynamic models, they should already be registered to the ModelStore and cached and caught here before getting duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense - I removed dynamic_feature_models and renamed the function to create_feature_model since it no longer caches entries and relies on the ModelStore.get call to provide the cache. This should be fine, unless sometime in the future we need a list of dynamic models only (in which case something will have to be added to track these in ModelStore or elsewhere).

@dmpetrov
Copy link
Member

@dtulga great change - neat and clean! I see there are already good comment inline please address these and merge.

type_name, fr_type = SignalSchema._get_name_original_type(fr_type)
signals[name] = type_name
self.serialize_custom_model_fields(type_name, fr_type, custom_types)
if custom_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something, or is custom_types always empty here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be an attribute of SignalSchema? It looks like a local variable but doesn't seem to be used like one AFAICT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom_types dict is set within serialize_custom_model_fields, see above for example: https://github.com/iterative/datachain/pull/348/files/2a86ad17b910d39e197ea24e3fd4a507b62249d0#diff-69a1fc2d77ff106392ad7649ed8cf7ec344dad230af2e4d0ea9c64b816cee15bR200

I didn't want to just make this a return-based set as it would lead to extra code complexity (due to the recursive nature of serialize_custom_model_fields) and I didn't want to make this an attribute of SignalSchema as this information is only valid during the call to serialize and may become out of date or incorrect at any other times (and is not used at any other times).

@dtulga dtulga merged commit 49aec42 into main Aug 27, 2024
38 checks passed
@dtulga dtulga deleted the dtulga/signal-schema-custom-types branch August 27, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants