Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic heaven #45

Merged
merged 23 commits into from
Jul 16, 2024
Merged

Pydantic heaven #45

merged 23 commits into from
Jul 16, 2024

Conversation

dmpetrov
Copy link
Member

@dmpetrov dmpetrov commented Jul 14, 2024

Addresses #43

  1. Feature class is not required anymore, just use pydantic's BaseModel ✨
  2. Dynamic classes are not needed anymore - removed pydantic_to_feature()
  3. Parallel compute issue was solved due to (2)
  4. Feature was renamed to DataModel. And simplified - now it's very lightweight class.
  5. Note, inheriting from DataModel (ex-Feature) is still preferred since it guarantees a proper deserialization/.from_dataset(). pydantic.BaseModel class might require registering before deserialization like Registry.add(MyPydanticClass). A few tricks were implemented to mitigate the issue:
    • map/gen/agg auto-register all output types
    • save() - the same

In practice, it means, you have a choice:

  • keep inheriting your classes from DataModel (ex-Feature)
  • using pydantic's BaseModel but not forget to register it DataModel.register(MyClass)

The latest is more convenient when you work with external pydantic classes such as Claude or Mistral. The 1st approach is easier to use.

ToDo

  • Modify Claude examples to this new API
  • Refactoring: completely remove Feature class, rename Feature tool/util classes and file
  • Rename Feature to DataModel

See Claude examples:

import anthropic
from anthropic.types import Message  # This class is used directly without any conversion

chain = (
    DataChain.from_storage(DATA, type="text")
    .filter(Column("file.name").glob("*.txt"))
    # .limit(5)
    .settings(parallel=4, cache=True)
    .setup(client=lambda: anthropic.Anthropic(api_key=API_KEY))
    .map(
        claude=lambda client, file: client.messages.create(
            model=MODEL,
            system=PROMPT,
            messages=[
                {
                    "role": "user",
                    "content": file.get_value() if isinstance(file, File) else file,
                },
            ],
            temperature=TEMPERATURE,
            max_tokens=DEFAULT_OUTPUT_TOKENS,
        ),
        output=Message,
    )
)

@dmpetrov dmpetrov marked this pull request as draft July 14, 2024 20:14
@dmpetrov dmpetrov marked this pull request as ready for review July 15, 2024 07:42
from datachain.lib.feature_utils import pydantic_to_feature
from datachain.lib.file import File, FileError, FileFeature, IndexedFile, TarVFile
from datachain.lib.file import File, FileError, IndexedFile, TarVFile
from datachain.lib.image import ImageFile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure it's consistent with #31. We need to make sure we don't import all cv libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image exports were moved to src/datachain/image/__init__.py and text to src/datachain/text/__init__.py

FeatureType = Union[type["Feature"], FeatureStandardType]
FeatureTypeNames = "Feature, int, str, float, bool, list, dict, bytes, datetime"

FeatureType = Union[type[BaseModel], FeatureStandardType]

This comment was marked as outdated.

This comment was marked as outdated.

@amritghimire
Copy link
Contributor

Running the script at https://github.com/iterative/datachain/blob/860be4fe5ec9b00ee90b62c71c564650b0d39d1a/tests/scripts/feature_class_parallel.py ,

  • datachain query <query_script.py>

Still throws the error because it cannot identify the source code of the feature code to be extracted because it runs in dynamic mode and doesn't have the access to source code. We cannot remove the workaround either.

  • python <query_script.py> works. But we cannot remove the workaround of identifying the feature class and moving it to a separate temporary file by removing the context manager here.
    with self.process_feature_module():
    and/or probably removing
    feature_import = ast.ImportFrom(
    module=feature_module_name,
    names=[ast.alias(name="*", asname=None)],
    level=0,
    )
    feature_module = form_module_source([*finder.imports, *finder.feature_class])
    main_module = form_module_source(
    [*finder.imports, feature_import, *finder.main_body]
    )
    return feature_module, main_module
    and instead returning the main ast outright.

Copy link

codecov bot commented Jul 15, 2024

Codecov Report

Attention: Patch coverage is 88.93617% with 26 lines in your changes missing coverage. Please review.

Project coverage is 84.03%. Comparing base (0f9ed69) to head (9f62196).

Files Patch % Lines
src/datachain/lib/signal_schema.py 73.17% 8 Missing and 3 partials ⚠️
src/datachain/lib/feature.py 90.62% 4 Missing and 5 partials ⚠️
src/datachain/lib/data_model.py 85.18% 3 Missing and 1 partial ⚠️
src/datachain/lib/udf.py 91.66% 0 Missing and 1 partial ⚠️
src/datachain/lib/udf_signature.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   83.86%   84.03%   +0.16%     
==========================================
  Files          91       91              
  Lines        9479     9406      -73     
  Branches     1855     1849       -6     
==========================================
- Hits         7950     7904      -46     
+ Misses       1211     1178      -33     
- Partials      318      324       +6     
Flag Coverage Δ
datachain 83.96% <88.93%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dmpetrov
Copy link
Member Author

  • datachain query <query_script.py>

@amritghimire thank you for catching this!
It should be solved as a followup item. We are not exposing any commands anyway #54

Copy link

cloudflare-workers-and-pages bot commented Jul 16, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9f62196
Status: ✅  Deploy successful!
Preview URL: https://609cbd05.datachain-documentation.pages.dev
Branch Preview URL: https://pydantic-heaven.datachain-documentation.pages.dev

View logs

Comment on lines 7 to 8
from datachain import Column
from datachain.lib.dc import C, DataChain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit redundant? Can we combine these two lines and pick one of either C or Column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Comment on lines 59 to 61
with pd.option_context("display.max_columns", None):
df = chain.to_pandas()
print(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but we should consider replacing all these statements in examples with chain.show().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 We should do that once show() is merged.

@@ -281,7 +267,7 @@ def get_file_type(
return get_file_type


class IndexedFile(Feature):
class IndexedFile(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it inherit from DataModel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you think it's unnecessary since it does not use get_value? Up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thank you.

@@ -105,7 +105,7 @@ def __iter__(self) -> Iterator[Any]:
for row_features in stream:
row = []
for fr in row_features:
if isinstance(fr, Feature):
if isinstance(fr, BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should either be checking for DataModel or we should check for hasattr(fr, "get_value").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseModel should be enough since any pydantic class is supported now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_value - has to be handled by the code that requests the value

fields = {name: (anno, ...) for name, anno in data_dict.items()}
return create_model( # type: ignore[call-overload]
name,
__base__=Feature,
__base__=BaseModel,
Copy link
Contributor

@dberenbaum dberenbaum Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this line since this is the default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@@ -265,22 +275,12 @@ def get_file_signals(self) -> Iterator[str]:
if has_subtree and issubclass(type_, File):
yield ".".join(path)

def create_model(self, name: str) -> type[Feature]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to keep it as a utility for exporting schema to pydantic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it just in case. Also, it looks like arrow converter uses it.

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments but overall LGTM!

from datachain.catalog import Catalog


class DataModel(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just name it Model? Is data adding anything of value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data distinguishes it from ML models.

@dmpetrov dmpetrov merged commit 6c0fdba into main Jul 16, 2024
19 checks passed
@dmpetrov dmpetrov deleted the pydantic_heaven branch July 16, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants