Adding Complex Type Support to Signal Schema #422

dtulga · 2024-09-10T23:28:00Z

This adds complex type support to Signal Schema, including in model serialization / deserialization. This includes old-style List and Dict support, along with many other types, nested types, and custom type support as well. This also includes fallbacks for when custom types are not resolvable or available, and has been tested to serialize and deserialize mistral types as well (which use many strange and custom types in their classes).

This has been tested to work locally, and with Studio for both data display and image previews, even with custom or third-party types, including mistral types. Note that this was tested before many of the current breaking changes for Studio were introduced, and I will test this again in Studio once those are resolved. (The Studio test failures appear unrelated to this PR, and are also failing on main.) This is part of https://github.com/iterative/studio/issues/10601 and fixes #98

codecov · 2024-09-10T23:36:19Z

Codecov Report

Attention: Patch coverage is 97.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 87.17%. Comparing base (43f5b21) to head (a21c51a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/signal_schema.py	97.72%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #422      +/-   ##
==========================================
+ Coverage   87.09%   87.17%   +0.08%     
==========================================
  Files          92       92              
  Lines        9928     9975      +47     
  Branches     2032     2053      +21     
==========================================
+ Hits         8647     8696      +49     
+ Misses        927      926       -1     
+ Partials      354      353       -1

Flag	Coverage Δ
datachain	`87.12% <97.77%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skshetry · 2024-09-11T01:17:45Z

src/datachain/lib/signal_schema.py

+        bracket_idx = type_name.find("[")
+        subtypes: Optional[tuple[Optional[type], ...]] = None
+        if bracket_idx > -1:
+            if bracket_idx == 0:
+                raise TypeError("Type cannot start with '['")
+            close_bracket_idx = type_name.rfind("]")
+            if close_bracket_idx == -1:
+                raise TypeError("Unclosed square bracket when parsing type")
+            if close_bracket_idx < bracket_idx:
+                raise TypeError("Square brackets are out of order when parsing type")
+            if close_bracket_idx == bracket_idx + 1:
+                raise TypeError("Empty square brackets when parsing type")
+            subtype_names = SignalSchema._split_subtypes(
+                type_name[bracket_idx + 1 : close_bracket_idx]
+            )


I think we should investigate how others do it (eg: pydantic).
This custom parsing looks very complicated.

I wonder if we can use eval or typing._eval_type()/typing.get_type_hints().

I'm not aware of any pydantic-based string to type conversion, but please do mention if that is available somewhere. I did think of using eval but I'm not sure of the security implications / possibility for other problems from using the eval function. Also, this section of the code mostly just checks for syntax errors, as seen in the test test_resolve_types_errors.

Aren't these trusted data (coming from _custom_types)?

My two cents: https://peps.python.org/pep-0563/#resolving-type-hints-at-runtime
I’m sure you’re already aware of this PEP 👀

Aren't these trusted data (coming from _custom_types)?

It is also depends on where do we run this. If we run this only in DataChain (CLI or Studio workers) — it is absolutely OK to use eval, but if we are going to run this inside Studio backend, we should be very careful, because _custom_types can be easily updated by user with malformed data.

I am OK with this code, it is not covering all possible scenarios and errors might happens in case of bad data, but I think it works just fine with data generated by DataChain and I think we should not overcomplicate it.

Other option might be using regexps btw 🤔

It's okay to eval even on Studio (they run inside a cluster). They are limited to types. (typing.get_type_hints() does exactly that).

_custom_types can be easily updated by user with malformed data.

We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).

but I think it works just fine with data generated by DataChain

DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to File models.

We are implementing our own parser (see _split_subtypes) for stringized annotations. And we are adding a lot of complexity in a already complicated module.

Other option might be using regexps btw

If we can avoid parsing this ourselves, that is my whole point.

It's okay to eval even on Studio (they run inside a cluster).

No, I mean in Studio backend, not DataChain compute cluster. And that's my only concern against eval.

We should not allow anyone to do it through our API. But a malicious user with access to cluster can do anything they like. So that's not an issue that we need to solve (hence why we have clusters).

It is OK to do anything inside the cluster, but one can send malformed data into our Studio backend and it is not OK to run eval there. We can not even run checks with eval in API endpoint because it will be run inside Studio backend, and at the same time we can not trust to checks ran in DataChain cluster.

DataChain generates arbitrary models, so it should work with anything. DataChain is no longer limited to File models.

We are implementing our own parser (see _split_subtypes) for stringized annotations. And we are adding a lot of complexity in a already complicated module.

I agree 😥

Other option might be using regexps btw

If we can avoid parsing this ourselves, that is my whole point.

Yes, this is not a replacement for this parser, this is just one more way to do this, worth mentioning.

Signal Schema deserialization is run within the Studio backend (not just on DataChain workers), as seen here: https://github.com/iterative/studio/blob/2cca2ff0448ed96e2ac5a892dbb30eea09c4f524/backend/dqlapp/schema/types.py#L278 which is why I did not want to use anything like eval (or any functions that call eval as well) as this could be a security risk within the Studio backend.

As well, I initially tried regular expressions, but that seemed like it would be similar levels of complexity and also potentially harder to read / understand.

In addition, if there are any other error cases that I did not cover, feel free to mention and I'll be happy to add any necessary checks or unit tests for those as well.

Thank you @dtulga ❤️ For me it looks good! 👍

dreadatour

Looks good to me overall. Great job! 👍

Tested this PR on my localhost, and it solved the bug (see this comment).

Also studio tests are failing 😥 But it looks like it is not related to this PR, because it fails in also main branch (e.g. here).

cloudflare-workers-and-pages · 2024-09-11T22:49:01Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`a21c51a`
Status:	✅ Deploy successful!
Preview URL:	https://f1087568.datachain-documentation.pages.dev
Branch Preview URL:	https://dtulga-signal-schema-complex.datachain-documentation.pages.dev

View logs

This reverts commit d6a29df.

skshetry · 2024-09-12T02:44:03Z

I don't think there's any need for being paranoid (they are just types).

At least, I'd have loved to see other alternatives and research into other tools before getting this merged. Again, parsing type annotations is already complicated, and it's a moving target.
It's also hard to guarantee that this approach supports everything.

Adding Complex Type Support to Signal Schema

289086c

dtulga self-assigned this Sep 10, 2024

dtulga mentioned this pull request Sep 10, 2024

SignalSchemaError when saving a dataset with a list of floats. #98

Closed

dtulga marked this pull request as ready for review September 10, 2024 23:57

dtulga requested a review from a team September 10, 2024 23:58

skshetry reviewed Sep 11, 2024

View reviewed changes

dreadatour approved these changes Sep 11, 2024

View reviewed changes

Merge from main

a21c51a

dtulga merged commit d6a29df into main Sep 12, 2024
38 checks passed

dtulga deleted the dtulga/signal-schema-complex-types branch September 12, 2024 01:46

dtulga added a commit that referenced this pull request Sep 12, 2024

Revert "Adding Complex Type Support to Signal Schema (#422)"

195cfec

This reverts commit d6a29df.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Complex Type Support to Signal Schema #422

Adding Complex Type Support to Signal Schema #422

dtulga commented Sep 10, 2024 •

edited

Loading

codecov bot commented Sep 10, 2024 •

edited

Loading

skshetry Sep 11, 2024 •

edited

Loading

skshetry Sep 11, 2024

dtulga Sep 11, 2024

skshetry Sep 11, 2024

dreadatour Sep 11, 2024

dreadatour Sep 11, 2024

skshetry Sep 11, 2024

dreadatour Sep 11, 2024 •

edited

Loading

dtulga Sep 11, 2024

dreadatour Sep 12, 2024

dreadatour left a comment

cloudflare-workers-and-pages bot commented Sep 11, 2024 •

edited

Loading

skshetry commented Sep 12, 2024

Adding Complex Type Support to Signal Schema #422

Adding Complex Type Support to Signal Schema #422

Conversation

dtulga commented Sep 10, 2024 • edited Loading

codecov bot commented Sep 10, 2024 • edited Loading

Codecov Report

skshetry Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dreadatour Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dreadatour left a comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Sep 11, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

skshetry commented Sep 12, 2024

dtulga commented Sep 10, 2024 •

edited

Loading

codecov bot commented Sep 10, 2024 •

edited

Loading

skshetry Sep 11, 2024 •

edited

Loading

dreadatour Sep 11, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 11, 2024 •

edited

Loading