Pydantic heaven #45

dmpetrov · 2024-07-14T20:10:28Z

Addresses #43

Feature class is not required anymore, just use pydantic's BaseModel ✨
Dynamic classes are not needed anymore - removed pydantic_to_feature()
Parallel compute issue was solved due to (2)
Feature was renamed to DataModel. And simplified - now it's very lightweight class.
Note, inheriting from DataModel (ex-Feature) is still preferred since it guarantees a proper deserialization/.from_dataset(). pydantic.BaseModel class might require registering before deserialization like Registry.add(MyPydanticClass). A few tricks were implemented to mitigate the issue:
- map/gen/agg auto-register all output types
- save() - the same

In practice, it means, you have a choice:

keep inheriting your classes from DataModel (ex-Feature)
using pydantic's BaseModel but not forget to register it DataModel.register(MyClass)

The latest is more convenient when you work with external pydantic classes such as Claude or Mistral. The 1st approach is easier to use.

ToDo

Modify Claude examples to this new API
Refactoring: completely remove Feature class, rename Feature tool/util classes and file
Rename Feature to DataModel

See Claude examples:

import anthropic
from anthropic.types import Message  # This class is used directly without any conversion

chain = (
    DataChain.from_storage(DATA, type="text")
    .filter(Column("file.name").glob("*.txt"))
    # .limit(5)
    .settings(parallel=4, cache=True)
    .setup(client=lambda: anthropic.Anthropic(api_key=API_KEY))
    .map(
        claude=lambda client, file: client.messages.create(
            model=MODEL,
            system=PROMPT,
            messages=[
                {
                    "role": "user",
                    "content": file.get_value() if isinstance(file, File) else file,
                },
            ],
            temperature=TEMPERATURE,
            max_tokens=DEFAULT_OUTPUT_TOKENS,
        ),
        output=Message,
    )
)

dberenbaum · 2024-07-15T11:43:58Z

src/datachain/__init__.py

-from datachain.lib.feature_utils import pydantic_to_feature
-from datachain.lib.file import File, FileError, FileFeature, IndexedFile, TarVFile
+from datachain.lib.file import File, FileError, IndexedFile, TarVFile
+from datachain.lib.image import ImageFile


Let's make sure it's consistent with #31. We need to make sure we don't import all cv libraries.

The image exports were moved to src/datachain/image/__init__.py and text to src/datachain/text/__init__.py

src/datachain/lib/feature.py

-FeatureType = Union[type["Feature"], FeatureStandardType]
-FeatureTypeNames = "Feature, int, str, float, bool, list, dict, bytes, datetime"
-
+FeatureType = Union[type[BaseModel], FeatureStandardType]


amritghimire · 2024-07-15T12:34:54Z

Running the script at https://github.com/iterative/datachain/blob/860be4fe5ec9b00ee90b62c71c564650b0d39d1a/tests/scripts/feature_class_parallel.py ,

datachain query <query_script.py>

Still throws the error because it cannot identify the source code of the feature code to be extracted because it runs in dynamic mode and doesn't have the access to source code. We cannot remove the workaround either.

python <query_script.py> works. But we cannot remove the workaround of identifying the feature class and moving it to a separate temporary file by removing the context manager here.

datachain/src/datachain/query/dataset.py

Line 533 in 860be4f

with self.process_feature_module():

and/or probably removing

datachain/src/datachain/catalog/catalog.py

Lines 667 to 677 in 317c955

    
           feature_import = ast.ImportFrom( 
        
               module=feature_module_name, 
        
               names=[ast.alias(name="*", asname=None)], 
        
               level=0, 
        
           ) 
        
           feature_module = form_module_source([*finder.imports, *finder.feature_class]) 
        
           main_module = form_module_source( 
        
               [*finder.imports, feature_import, *finder.main_body] 
        
           ) 
        
           return feature_module, main_module

and instead returning the main ast outright.

codecov · 2024-07-15T17:14:21Z

Codecov Report

Attention: Patch coverage is 88.93617% with 26 lines in your changes missing coverage. Please review.

Project coverage is 84.03%. Comparing base (0f9ed69) to head (9f62196).

Files	Patch %	Lines
src/datachain/lib/signal_schema.py	73.17%	8 Missing and 3 partials ⚠️
src/datachain/lib/feature.py	90.62%	4 Missing and 5 partials ⚠️
src/datachain/lib/data_model.py	85.18%	3 Missing and 1 partial ⚠️
src/datachain/lib/udf.py	91.66%	0 Missing and 1 partial ⚠️
src/datachain/lib/udf_signature.py	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   83.86%   84.03%   +0.16%     
==========================================
  Files          91       91              
  Lines        9479     9406      -73     
  Branches     1855     1849       -6     
==========================================
- Hits         7950     7904      -46     
+ Misses       1211     1178      -33     
- Partials      318      324       +6

Flag	Coverage Δ
datachain	`83.96% <88.93%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dmpetrov · 2024-07-15T18:32:09Z

datachain query <query_script.py>

@amritghimire thank you for catching this!
It should be solved as a followup item. We are not exposing any commands anyway #54

cloudflare-workers-and-pages · 2024-07-16T07:49:52Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`9f62196`
Status:	✅ Deploy successful!
Preview URL:	https://609cbd05.datachain-documentation.pages.dev
Branch Preview URL:	https://pydantic-heaven.datachain-documentation.pages.dev

View logs

dberenbaum · 2024-07-16T13:37:37Z

examples/llm-claude-aggregate-query.py

+from datachain import Column
 from datachain.lib.dc import C, DataChain


Seems a bit redundant? Can we combine these two lines and pick one of either C or Column?

dberenbaum · 2024-07-16T13:39:09Z

examples/llm-claude-aggregate-query.py

 with pd.option_context("display.max_columns", None):
+    df = chain.to_pandas()
    print(df)


Very minor, but we should consider replacing all these statements in examples with chain.show().

💯 We should do that once show() is merged.

dberenbaum · 2024-07-16T13:47:08Z

src/datachain/lib/file.py

@@ -281,7 +267,7 @@ def get_file_type(
    return get_file_type


-class IndexedFile(Feature):
+class IndexedFile(BaseModel):


Should it inherit from DataModel?

Or do you think it's unnecessary since it does not use get_value? Up to you.

Good catch! Thank you.

dberenbaum · 2024-07-16T13:49:51Z

src/datachain/lib/pytorch.py

@@ -105,7 +105,7 @@ def __iter__(self) -> Iterator[Any]:
        for row_features in stream:
            row = []
            for fr in row_features:
-                if isinstance(fr, Feature):
+                if isinstance(fr, BaseModel):


I think it should either be checking for DataModel or we should check for hasattr(fr, "get_value").

BaseModel should be enough since any pydantic class is supported now.

get_value - has to be handled by the code that requests the value

dberenbaum · 2024-07-16T13:56:54Z

src/datachain/lib/feature_utils.py

    fields = {name: (anno, ...) for name, anno in data_dict.items()}
    return create_model(  # type: ignore[call-overload]
        name,
-        __base__=Feature,
+        __base__=BaseModel,


Can we drop this line since this is the default?

good point!

dberenbaum · 2024-07-16T13:57:28Z

src/datachain/lib/signal_schema.py

@@ -265,22 +275,12 @@ def get_file_signals(self) -> Iterator[str]:
            if has_subtree and issubclass(type_, File):
                yield ".".join(path)

-    def create_model(self, name: str) -> type[Feature]:


Any reason not to keep it as a utility for exporting schema to pydantic?

Let's keep it just in case. Also, it looks like arrow converter uses it.

dberenbaum

A few minor comments but overall LGTM!

skshetry · 2024-07-16T16:37:04Z

src/datachain/lib/data_model.py

+    from datachain.catalog import Catalog
+
+
+class DataModel(BaseModel):


Should we just name it Model? Is data adding anything of value?

Data distinguishes it from ML models.

dmpetrov added 5 commits July 13, 2024 14:17

rm _is_file from Feature

a6b8db2

Get rid of Feature class

366c8d8

unflatten_to_json: support list of objects

c6883ff

claude: bo Feature, no common classes

6cf05ea

convert_type_to_datachain: extract JSON check

80d4927

dmpetrov marked this pull request as draft July 14, 2024 20:14

dmpetrov added 12 commits July 14, 2024 13:28

Clean up dead code

3be3909

rm Feature class completely

860be4f

Fix unit-tests

64a2e87

Auto-register data classes & misc changes

cc718ae

unit-test: fix

c3fc27e

fix unit test

4e2fb5c

Merge branch 'main' into pydantic_heaven

5936827

after merge fix

aa7d755

Intro DataModel and move register feature there

2ca79f7

into is_pydantic()

cabcfd0

linter

c1156db

fix tests for python<3.10

b912acc

dmpetrov force-pushed the pydantic_heaven branch from 77d1851 to b912acc Compare July 15, 2024 07:31

dmpetrov marked this pull request as ready for review July 15, 2024 07:42

dmpetrov requested review from rlamy, dberenbaum, dreadatour and dtulga July 15, 2024 07:42

dmpetrov mentioned this pull request Jul 15, 2024

Refactor type conversion 2: return of the datachain 😄 #26

Closed

dberenbaum reviewed Jul 15, 2024

View reviewed changes

Merge branch 'main' into pydantic_heaven

1d794ba

dmpetrov added 2 commits July 15, 2024 10:15

Merge branch 'main' into pydantic_heaven

60a11e9

fixes after main merge

7a01162

Merge branch 'main' into pydantic_heaven

813a8b0

dberenbaum reviewed Jul 16, 2024

View reviewed changes

dberenbaum approved these changes Jul 16, 2024

View reviewed changes

dmpetrov added 2 commits July 16, 2024 09:06

after review changes & print_schema small fix

1750951

linter

9f62196

skshetry reviewed Jul 16, 2024

View reviewed changes

dmpetrov merged commit 6c0fdba into main Jul 16, 2024
19 checks passed

dmpetrov deleted the pydantic_heaven branch July 16, 2024 16:50

This was referenced Jul 16, 2024

Parallel execution: Can't pickle #20

Closed

Can Pydantic classes be natively supported without Feature #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic heaven #45

Pydantic heaven #45

dmpetrov commented Jul 14, 2024 •

edited

Loading

dberenbaum Jul 15, 2024

mattseddon Jul 15, 2024

This comment was marked as outdated.

This comment was marked as outdated.

amritghimire commented Jul 15, 2024

codecov bot commented Jul 15, 2024 •

edited

Loading

dmpetrov commented Jul 15, 2024

cloudflare-workers-and-pages bot commented Jul 16, 2024 •

edited

Loading

dberenbaum Jul 16, 2024

dmpetrov Jul 16, 2024

dberenbaum Jul 16, 2024

dmpetrov Jul 16, 2024

dberenbaum Jul 16, 2024

dberenbaum Jul 16, 2024

dmpetrov Jul 16, 2024

dberenbaum Jul 16, 2024

dmpetrov Jul 16, 2024

dmpetrov Jul 16, 2024

dberenbaum Jul 16, 2024 •

edited

Loading

dmpetrov Jul 16, 2024

dberenbaum Jul 16, 2024

dmpetrov Jul 16, 2024

dberenbaum left a comment

skshetry Jul 16, 2024

dmpetrov Jul 16, 2024

		from datachain import Column
		from datachain.lib.dc import C, DataChain

		from datachain.catalog import Catalog


		class DataModel(BaseModel):

Pydantic heaven #45

Pydantic heaven #45

Conversation

dmpetrov commented Jul 14, 2024 • edited Loading

ToDo

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

amritghimire commented Jul 15, 2024

codecov bot commented Jul 15, 2024 • edited Loading

Codecov Report

dmpetrov commented Jul 15, 2024

cloudflare-workers-and-pages bot commented Jul 16, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmpetrov commented Jul 14, 2024 •

edited

Loading

codecov bot commented Jul 15, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 16, 2024 •

edited

Loading

dberenbaum Jul 16, 2024 •

edited

Loading