Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Complex Type Support to Signal Schema #422

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

dtulga
Copy link
Contributor

@dtulga dtulga commented Sep 10, 2024

This adds complex type support to Signal Schema, including in model serialization / deserialization. This includes old-style List and Dict support, along with many other types, nested types, and custom type support as well. This also includes fallbacks for when custom types are not resolvable or available, and has been tested to serialize and deserialize mistral types as well (which use many strange and custom types in their classes).

This has been tested to work locally, and with Studio for both data display and image previews, even with custom or third-party types, including mistral types. Note that this was tested before many of the current breaking changes for Studio were introduced, and I will test this again in Studio once those are resolved. (The Studio test failures appear unrelated to this PR, and are also failing on main.) This is part of https://github.com/iterative/studio/issues/10601 and fixes #98

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 97.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.17%. Comparing base (43f5b21) to head (a21c51a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/lib/signal_schema.py 97.72% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #422      +/-   ##
==========================================
+ Coverage   87.09%   87.17%   +0.08%     
==========================================
  Files          92       92              
  Lines        9928     9975      +47     
  Branches     2032     2053      +21     
==========================================
+ Hits         8647     8696      +49     
+ Misses        927      926       -1     
+ Partials      354      353       -1     
Flag Coverage Δ
datachain 87.12% <97.77%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dtulga dtulga marked this pull request as ready for review September 10, 2024 23:57
@dtulga dtulga requested a review from a team September 10, 2024 23:58
Comment on lines +237 to +251
bracket_idx = type_name.find("[")
subtypes: Optional[tuple[Optional[type], ...]] = None
if bracket_idx > -1:
if bracket_idx == 0:
raise TypeError("Type cannot start with '['")
close_bracket_idx = type_name.rfind("]")
if close_bracket_idx == -1:
raise TypeError("Unclosed square bracket when parsing type")
if close_bracket_idx < bracket_idx:
raise TypeError("Square brackets are out of order when parsing type")
if close_bracket_idx == bracket_idx + 1:
raise TypeError("Empty square brackets when parsing type")
subtype_names = SignalSchema._split_subtypes(
type_name[bracket_idx + 1 : close_bracket_idx]
)
Copy link
Member

@skshetry skshetry Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should investigate how others do it (eg: pydantic).
This custom parsing looks very complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use eval or typing._eval_type()/typing.get_type_hints().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any pydantic-based string to type conversion, but please do mention if that is available somewhere. I did think of using eval but I'm not sure of the security implications / possibility for other problems from using the eval function. Also, this section of the code mostly just checks for syntax errors, as seen in the test test_resolve_types_errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these trusted data (coming from _custom_types)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents: https://peps.python.org/pep-0563/#resolving-type-hints-at-runtime
I’m sure you’re already aware of this PEP 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these trusted data (coming from _custom_types)?

It is also depends on where do we run this. If we run this only in DataChain (CLI or Studio workers) — it is absolutely OK to use eval, but if we are going to run this inside Studio backend, we should be very careful, because _custom_types can be easily updated by user with malformed data.

I am OK with this code, it is not covering all possible scenarios and errors might happens in case of bad data, but I think it works just fine with data generated by DataChain and I think we should not overcomplicate it.

Other option might be using regexps btw 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to eval even on Studio (they run inside a cluster). They are limited to types. (typing.get_type_hints() does exactly that).

_custom_types can be easily updated by user with malformed data.

We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).

but I think it works just fine with data generated by DataChain

DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to File models.

We are implementing our own parser (see _split_subtypes) for stringized annotations. And we are adding a lot of complexity in a already complicated module.

Other option might be using regexps btw

If we can avoid parsing this ourselves, that is my whole point.

Copy link
Contributor

@dreadatour dreadatour Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to eval even on Studio (they run inside a cluster).

No, I mean in Studio backend, not DataChain compute cluster. And that's my only concern against eval.

We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).

It is OK to do anything inside the cluster, but one can send malformed data into our Studio backend and it is not OK to run eval there. We can not even run checks with eval in API endpoint because it will be run inside Studio backend, and at the same time we can not trust to checks ran in DataChain cluster.

DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to File models.

We are implementing our own parser (see _split_subtypes) for stringized annotations. And we are adding a lot of complexity in a already complicated module.

I agree 😥

Other option might be using regexps btw

If we can avoid parsing this ourselves, that is my whole point.

Yes, this is not a replacement for this parser, this is just one more way to do this, worth mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signal Schema deserialization is run within the Studio backend (not just on DataChain workers), as seen here: https://github.com/iterative/studio/blob/2cca2ff0448ed96e2ac5a892dbb30eea09c4f524/backend/dqlapp/schema/types.py#L278 which is why I did not want to use anything like eval (or any functions that call eval as well) as this could be a security risk within the Studio backend.

As well, I initially tried regular expressions, but that seemed like it would be similar levels of complexity and also potentially harder to read / understand.

In addition, if there are any other error cases that I did not cover, feel free to mention and I'll be happy to add any necessary checks or unit tests for those as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dtulga ❤️ For me it looks good! 👍

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Great job! 👍

Tested this PR on my localhost, and it solved the bug (see this comment).

Also studio tests are failing 😥 But it looks like it is not related to this PR, because it fails in also main branch (e.g. here).

Copy link

cloudflare-workers-and-pages bot commented Sep 11, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: a21c51a
Status: ✅  Deploy successful!
Preview URL: https://f1087568.datachain-documentation.pages.dev
Branch Preview URL: https://dtulga-signal-schema-complex.datachain-documentation.pages.dev

View logs

@dtulga dtulga merged commit d6a29df into main Sep 12, 2024
38 checks passed
@dtulga dtulga deleted the dtulga/signal-schema-complex-types branch September 12, 2024 01:46
dtulga added a commit that referenced this pull request Sep 12, 2024
@skshetry
Copy link
Member

I don't think there's any need for being paranoid (they are just types).

At least, I'd have loved to see other alternatives and research into other tools before getting this merged. Again, parsing type annotations is already complicated, and it's a moving target.
It's also hard to guarantee that this approach supports everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SignalSchemaError when saving a dataset with a list of floats.
3 participants