-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Complex Type Support to Signal Schema #422
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should investigate how others do it (eg:
pydantic
).This custom parsing looks very complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use
eval
ortyping._eval_type()
/typing.get_type_hints()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any pydantic-based string to type conversion, but please do mention if that is available somewhere. I did think of using
eval
but I'm not sure of the security implications / possibility for other problems from using theeval
function. Also, this section of the code mostly just checks for syntax errors, as seen in the testtest_resolve_types_errors
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these trusted data (coming from
_custom_types
)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents: https://peps.python.org/pep-0563/#resolving-type-hints-at-runtime
I’m sure you’re already aware of this PEP 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also depends on where do we run this. If we run this only in DataChain (CLI or Studio workers) — it is absolutely OK to use
eval
, but if we are going to run this inside Studio backend, we should be very careful, because_custom_types
can be easily updated by user with malformed data.I am OK with this code, it is not covering all possible scenarios and errors might happens in case of bad data, but I think it works just fine with data generated by DataChain and I think we should not overcomplicate it.
Other option might be using regexps btw 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to
eval
even on Studio (they run inside a cluster). They are limited to types. (typing.get_type_hints()
does exactly that).We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).
DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to
File
models.We are implementing our own parser (see
_split_subtypes
) for stringized annotations. And we are adding a lot of complexity in a already complicated module.If we can avoid parsing this ourselves, that is my whole point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean in Studio backend, not DataChain compute cluster. And that's my only concern against
eval
.It is OK to do anything inside the cluster, but one can send malformed data into our Studio backend and it is not OK to run
eval
there. We can not even run checks witheval
in API endpoint because it will be run inside Studio backend, and at the same time we can not trust to checks ran in DataChain cluster.I agree 😥
Yes, this is not a replacement for this parser, this is just one more way to do this, worth mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signal Schema deserialization is run within the Studio backend (not just on DataChain workers), as seen here: https://github.com/iterative/studio/blob/2cca2ff0448ed96e2ac5a892dbb30eea09c4f524/backend/dqlapp/schema/types.py#L278 which is why I did not want to use anything like
eval
(or any functions that calleval
as well) as this could be a security risk within the Studio backend.As well, I initially tried regular expressions, but that seemed like it would be similar levels of complexity and also potentially harder to read / understand.
In addition, if there are any other error cases that I did not cover, feel free to mention and I'll be happy to add any necessary checks or unit tests for those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dtulga ❤️ For me it looks good! 👍