Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all column types in SignalSchema.from_column_types #319

Merged
merged 2 commits into from
Aug 21, 2024

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Aug 19, 2024

Having registered dataset:

from datachain.lib.dc import DataChain 

DataChain.from_storage("s3://dql-50k-laion-files/").save("laion")

Making basic query on it:

from datachain.lib.dc import C, DataChain 

DataChain(name="laion").filter(C.file__path.glob("*.jpg"))

Fails with error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/Users/vlad/work/iterative/datachain/src/datachain/lib/dc.py", line 227, in __init__
    self.signals_schema |= SignalSchema.from_column_types(self.column_types)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vlad/work/iterative/datachain/src/datachain/lib/signal_schema.py", line 108, in from_column_types
    raise SignalSchemaError(
datachain.lib.signal_schema.SignalSchemaError: signal schema cannot be obtained for column 'sys__rand': unsupported type 'None'
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/Users/vlad/work/iterative/datachain/src/datachain/lib/dc.py", line 227, in __init__
    self.signals_schema |= SignalSchema.from_column_types(self.column_types)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vlad/work/iterative/datachain/src/datachain/lib/signal_schema.py", line 108, in from_column_types
    raise SignalSchemaError(
datachain.lib.signal_schema.SignalSchemaError: signal schema cannot be obtained for column 'sys__rand': unsupported type 'None'

This is because sys__rand column type is UInt64 (at least in SaaS), and this type is not exists in DATACHAIN_TO_TYPE, which leads to an error here: https://github.com/iterative/datachain/blob/main/src/datachain/lib/signal_schema.py#L107-L111

@dreadatour dreadatour requested review from dmpetrov and a team August 19, 2024 17:42
@dreadatour dreadatour self-assigned this Aug 19, 2024
@dreadatour dreadatour added the bug Something isn't working label Aug 19, 2024
Copy link

codecov bot commented Aug 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.00%. Comparing base (06fdd8c) to head (5a9f335).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #319      +/-   ##
==========================================
+ Coverage   86.94%   87.00%   +0.05%     
==========================================
  Files          90       90              
  Lines        9898     9901       +3     
  Branches     1995     1997       +2     
==========================================
+ Hits         8606     8614       +8     
+ Misses        944      939       -5     
  Partials      348      348              
Flag Coverage Δ
datachain 86.93% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a regression test somewhere?

@dreadatour dreadatour force-pushed the support-all-column-types branch from c5bbb39 to 81634c2 Compare August 20, 2024 02:30
Copy link

cloudflare-workers-and-pages bot commented Aug 20, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5a9f335
Status: ✅  Deploy successful!
Preview URL: https://6f207fc6.datachain-documentation.pages.dev
Branch Preview URL: https://support-all-column-types.datachain-documentation.pages.dev

View logs

Copy link
Contributor

@ilongin ilongin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dreadatour dreadatour force-pushed the support-all-column-types branch from 81634c2 to bb6ff4d Compare August 20, 2024 10:09
@dreadatour
Copy link
Contributor Author

is there a regression test somewhere?

Wasn't sure about this approach. Updated the PR with better code, added test, please, take a look 🙏

Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@dreadatour dreadatour force-pushed the support-all-column-types branch from d004430 to 5a9f335 Compare August 21, 2024 03:11
@dreadatour dreadatour merged commit 77947c8 into main Aug 21, 2024
38 checks passed
@dreadatour dreadatour deleted the support-all-column-types branch August 21, 2024 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants