-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pydantic heaven #45
Pydantic heaven #45
Conversation
77d1851
to
b912acc
Compare
from datachain.lib.feature_utils import pydantic_to_feature | ||
from datachain.lib.file import File, FileError, FileFeature, IndexedFile, TarVFile | ||
from datachain.lib.file import File, FileError, IndexedFile, TarVFile | ||
from datachain.lib.image import ImageFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure it's consistent with #31. We need to make sure we don't import all cv libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image exports were moved to src/datachain/image/__init__.py
and text to src/datachain/text/__init__.py
FeatureType = Union[type["Feature"], FeatureStandardType] | ||
FeatureTypeNames = "Feature, int, str, float, bool, list, dict, bytes, datetime" | ||
|
||
FeatureType = Union[type[BaseModel], FeatureStandardType] |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Running the script at https://github.com/iterative/datachain/blob/860be4fe5ec9b00ee90b62c71c564650b0d39d1a/tests/scripts/feature_class_parallel.py ,
Still throws the error because it cannot identify the source code of the feature code to be extracted because it runs in dynamic mode and doesn't have the access to source code. We cannot remove the workaround either.
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #45 +/- ##
==========================================
+ Coverage 83.86% 84.03% +0.16%
==========================================
Files 91 91
Lines 9479 9406 -73
Branches 1855 1849 -6
==========================================
- Hits 7950 7904 -46
+ Misses 1211 1178 -33
- Partials 318 324 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@amritghimire thank you for catching this! |
Deploying datachain-documentation with
|
Latest commit: |
9f62196
|
Status: | ✅ Deploy successful! |
Preview URL: | https://609cbd05.datachain-documentation.pages.dev |
Branch Preview URL: | https://pydantic-heaven.datachain-documentation.pages.dev |
from datachain import Column | ||
from datachain.lib.dc import C, DataChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit redundant? Can we combine these two lines and pick one of either C
or Column
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
with pd.option_context("display.max_columns", None): | ||
df = chain.to_pandas() | ||
print(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but we should consider replacing all these statements in examples with chain.show()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 We should do that once show() is merged.
src/datachain/lib/file.py
Outdated
@@ -281,7 +267,7 @@ def get_file_type( | |||
return get_file_type | |||
|
|||
|
|||
class IndexedFile(Feature): | |||
class IndexedFile(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it inherit from DataModel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you think it's unnecessary since it does not use get_value
? Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thank you.
@@ -105,7 +105,7 @@ def __iter__(self) -> Iterator[Any]: | |||
for row_features in stream: | |||
row = [] | |||
for fr in row_features: | |||
if isinstance(fr, Feature): | |||
if isinstance(fr, BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should either be checking for DataModel
or we should check for hasattr(fr, "get_value")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseModel should be enough since any pydantic class is supported now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_value
- has to be handled by the code that requests the value
src/datachain/lib/feature_utils.py
Outdated
fields = {name: (anno, ...) for name, anno in data_dict.items()} | ||
return create_model( # type: ignore[call-overload] | ||
name, | ||
__base__=Feature, | ||
__base__=BaseModel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop this line since this is the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
@@ -265,22 +275,12 @@ def get_file_signals(self) -> Iterator[str]: | |||
if has_subtree and issubclass(type_, File): | |||
yield ".".join(path) | |||
|
|||
def create_model(self, name: str) -> type[Feature]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to keep it as a utility for exporting schema to pydantic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it just in case. Also, it looks like arrow converter uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments but overall LGTM!
from datachain.catalog import Catalog | ||
|
||
|
||
class DataModel(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just name it Model
? Is data
adding anything of value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data
distinguishes it from ML models.
Addresses #43
Feature
class is not required anymore, just use pydantic's BaseModel ✨pydantic_to_feature()
Feature
was renamed toDataModel
. And simplified - now it's very lightweight class.DataModel
(ex-Feature) is still preferred since it guarantees a proper deserialization/.from_dataset()
.pydantic.BaseModel
class might require registering before deserialization likeRegistry.add(MyPydanticClass)
. A few tricks were implemented to mitigate the issue:In practice, it means, you have a choice:
DataModel.register(MyClass)
The latest is more convenient when you work with external pydantic classes such as Claude or Mistral. The 1st approach is easier to use.
ToDo
See Claude examples: