-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Custom Type (De)Serialization to Signal Schema #348
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #348 +/- ##
==========================================
- Coverage 86.89% 86.68% -0.22%
==========================================
Files 90 90
Lines 9923 9945 +22
Branches 2003 2009 +6
==========================================
- Hits 8623 8621 -2
- Misses 947 975 +28
+ Partials 353 349 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Deploying datachain-documentation with
|
Latest commit: |
03666b6
|
Status: | ✅ Deploy successful! |
Preview URL: | https://d75f7f95.datachain-documentation.pages.dev |
Branch Preview URL: | https://dtulga-signal-schema-custom.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me as long as it has been tested with Studio 👍
src/datachain/lib/signal_schema.py
Outdated
if ( | ||
not ModelStore.is_pydantic(fr) | ||
or not issubclass(fr, BaseModel) | ||
or name in ("File", "TextFile", "ImageFile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be useful in future to have the list of these types somewhere as a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, although I'm not sure where to put them. Where do you think would be best for a "Built-In Feature Types" list? (As in, easy to maintain and not miss that these are listed somewhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/file.py comes to my mind first, where these models are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there much harm in dropping this condition and serializing custom fields even for those built-in types? That way, even if a future version were to drop or rename any of these, old datasets could still be read as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further consideration, I decided to just serialize those built-in models as well, since the default is to use the existing and defined model in the code first, this should be harmless unless they are renamed or deleted in the future, in which case having them serialized would be useful for reading old datasets.
src/datachain/lib/signal_schema.py
Outdated
@@ -69,6 +75,29 @@ def __init__(self, method: str, field): | |||
) | |||
|
|||
|
|||
dynamic_feature_models: dict[str, type[BaseModel]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be _dynamic_feature_models
(protected) or even __dynamic_feature_models
?
do we use it as a cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do this and custom_types
differ from the ModelStore
? Is it possible to reuse ModelStore
for either of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed it to _dynamic_feature_models
(protected) and yes, this is used as a cache.
This is intended more as a cache of the defined dynamic models, instead of the registered list of ModelStore
- as this only includes the created dynamic models, not all models. I'm not 100% clear on how ModelStore
is used and what caveats there would be with using it for this cache, especially since it does not have any mechanism for creating new models, only registering existing ones. I could look into integrating these two features together if you think it would be useful, but I'm not sure how complex that would be (and what side-effects there might be) and so it should probably be done in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried just dropping _dynamic_feature_models
and seeing if it makes any difference? AFAIU because you use DataModel
as the base when creating the dynamic models, they should already be registered to the ModelStore
and cached and caught here before getting duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense - I removed dynamic_feature_models
and renamed the function to create_feature_model
since it no longer caches entries and relies on the ModelStore.get
call to provide the cache. This should be fine, unless sometime in the future we need a list of dynamic models only (in which case something will have to be added to track these in ModelStore
or elsewhere).
@dtulga great change - neat and clean! I see there are already good comment inline please address these and merge. |
type_name, fr_type = SignalSchema._get_name_original_type(fr_type) | ||
signals[name] = type_name | ||
self.serialize_custom_model_fields(type_name, fr_type, custom_types) | ||
if custom_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something, or is custom_types
always empty here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be an attribute of SignalSchema
? It looks like a local variable but doesn't seem to be used like one AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom_types
dict is set within serialize_custom_model_fields
, see above for example: https://github.com/iterative/datachain/pull/348/files/2a86ad17b910d39e197ea24e3fd4a507b62249d0#diff-69a1fc2d77ff106392ad7649ed8cf7ec344dad230af2e4d0ea9c64b816cee15bR200
I didn't want to just make this a return-based set as it would lead to extra code complexity (due to the recursive nature of serialize_custom_model_fields
) and I didn't want to make this an attribute of SignalSchema
as this information is only valid during the call to serialize
and may become out of date or incorrect at any other times (and is not used at any other times).
This adds the ability to save and restore model classes (feature classes) to SignalSchema, by saving these in a
_custom_types
key, then restoring them by creating a dynamic pydantic class with the stored definition. This is useful for when a custom model class is declared and used to create a dataset, but then this dataset is viewed or used in a DataChain somewhere else where this original definition is not available. This will also be useful for any other import / export functions where the original types may need to be changed, such as in parquet import/export, although this will be included in a followup/separate PR(s) to avoid this getting too complex and keep these PRs focused.This has been tested locally and with Studio. This fixes https://github.com/iterative/studio/issues/10417 and is part of #91