-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Complex Type Support to Signal Schema #422
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #422 +/- ##
==========================================
+ Coverage 87.09% 87.17% +0.08%
==========================================
Files 92 92
Lines 9928 9975 +47
Branches 2032 2053 +21
==========================================
+ Hits 8647 8696 +49
+ Misses 927 926 -1
+ Partials 354 353 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
bracket_idx = type_name.find("[") | ||
subtypes: Optional[tuple[Optional[type], ...]] = None | ||
if bracket_idx > -1: | ||
if bracket_idx == 0: | ||
raise TypeError("Type cannot start with '['") | ||
close_bracket_idx = type_name.rfind("]") | ||
if close_bracket_idx == -1: | ||
raise TypeError("Unclosed square bracket when parsing type") | ||
if close_bracket_idx < bracket_idx: | ||
raise TypeError("Square brackets are out of order when parsing type") | ||
if close_bracket_idx == bracket_idx + 1: | ||
raise TypeError("Empty square brackets when parsing type") | ||
subtype_names = SignalSchema._split_subtypes( | ||
type_name[bracket_idx + 1 : close_bracket_idx] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should investigate how others do it (eg: pydantic
).
This custom parsing looks very complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use eval
or typing._eval_type()
/typing.get_type_hints()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any pydantic-based string to type conversion, but please do mention if that is available somewhere. I did think of using eval
but I'm not sure of the security implications / possibility for other problems from using the eval
function. Also, this section of the code mostly just checks for syntax errors, as seen in the test test_resolve_types_errors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these trusted data (coming from _custom_types
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents: https://peps.python.org/pep-0563/#resolving-type-hints-at-runtime
I’m sure you’re already aware of this PEP 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these trusted data (coming from _custom_types)?
It is also depends on where do we run this. If we run this only in DataChain (CLI or Studio workers) — it is absolutely OK to use eval
, but if we are going to run this inside Studio backend, we should be very careful, because _custom_types
can be easily updated by user with malformed data.
I am OK with this code, it is not covering all possible scenarios and errors might happens in case of bad data, but I think it works just fine with data generated by DataChain and I think we should not overcomplicate it.
Other option might be using regexps btw 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to eval
even on Studio (they run inside a cluster). They are limited to types. (typing.get_type_hints()
does exactly that).
_custom_types
can be easily updated by user with malformed data.
We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).
but I think it works just fine with data generated by DataChain
DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to File
models.
We are implementing our own parser (see _split_subtypes
) for stringized annotations. And we are adding a lot of complexity in a already complicated module.
Other option might be using regexps btw
If we can avoid parsing this ourselves, that is my whole point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to
eval
even on Studio (they run inside a cluster).
No, I mean in Studio backend, not DataChain compute cluster. And that's my only concern against eval
.
We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).
It is OK to do anything inside the cluster, but one can send malformed data into our Studio backend and it is not OK to run eval
there. We can not even run checks with eval
in API endpoint because it will be run inside Studio backend, and at the same time we can not trust to checks ran in DataChain cluster.
DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to
File
models.We are implementing our own parser (see
_split_subtypes
) for stringized annotations. And we are adding a lot of complexity in a already complicated module.
I agree 😥
Other option might be using regexps btw
If we can avoid parsing this ourselves, that is my whole point.
Yes, this is not a replacement for this parser, this is just one more way to do this, worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signal Schema deserialization is run within the Studio backend (not just on DataChain workers), as seen here: https://github.com/iterative/studio/blob/2cca2ff0448ed96e2ac5a892dbb30eea09c4f524/backend/dqlapp/schema/types.py#L278 which is why I did not want to use anything like eval
(or any functions that call eval
as well) as this could be a security risk within the Studio backend.
As well, I initially tried regular expressions, but that seemed like it would be similar levels of complexity and also potentially harder to read / understand.
In addition, if there are any other error cases that I did not cover, feel free to mention and I'll be happy to add any necessary checks or unit tests for those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dtulga ❤️ For me it looks good! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. Great job! 👍
Tested this PR on my localhost, and it solved the bug (see this comment).
Also studio tests are failing 😥 But it looks like it is not related to this PR, because it fails in also main
branch (e.g. here).
Deploying datachain-documentation with Cloudflare Pages
|
I don't think there's any need for being paranoid (they are just types). At least, I'd have loved to see other alternatives and research into other tools before getting this merged. Again, parsing type annotations is already complicated, and it's a moving target. |
This adds complex type support to Signal Schema, including in model serialization / deserialization. This includes old-style
List
andDict
support, along with many other types, nested types, and custom type support as well. This also includes fallbacks for when custom types are not resolvable or available, and has been tested to serialize and deserialize mistral types as well (which use many strange and custom types in their classes).This has been tested to work locally, and with Studio for both data display and image previews, even with custom or third-party types, including mistral types. Note that this was tested before many of the current breaking changes for Studio were introduced, and I will test this again in Studio once those are resolved. (The Studio test failures appear unrelated to this PR, and are also failing on
main
.) This is part of https://github.com/iterative/studio/issues/10601 and fixes #98