-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for instantiation of pulled dataset #573
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #573 +/- ##
==========================================
+ Coverage 87.83% 87.87% +0.04%
==========================================
Files 100 100
Lines 9993 10007 +14
Branches 1356 1357 +1
==========================================
+ Hits 8777 8794 +17
+ Misses 873 871 -2
+ Partials 343 342 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ilongin ! Can we add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, do we have any docs? what is the command that was broken?
@@ -1350,6 +1355,13 @@ def _instantiate_dataset(): | |||
# we will create new one if it doesn't exist | |||
pass | |||
|
|||
if dataset and version and dataset.has_version(version): | |||
"""No need to communicate with Studio at all""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Q] Is there some validation that happens to confirm that dataset versions are "the same" between local and remote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. my_dataset@v1
on remote is full of PDFs and my_dataset@v1
locally is images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. No, there is no such thing ATM. Currently it's up to user to make sure he doesn't "shadow" remote dataset. In order to make this kind of validation we need to have equal
function between datasets (I will create an issue for this) and my first feeling is that it's not that trivial to implement it (maybe I'm wrong, idk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems like a simple uuid should be enough for this, which is trivial
Deploying datachain-documentation with
|
Latest commit: |
24f28a0
|
Status: | ✅ Deploy successful! |
Preview URL: | https://64b5c9ed.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-566-fix-dataset-pull.datachain-documentation.pages.dev |
Added a test |
Command that was broken was So currently if user wants to pull dataset and instantiate it locally, it can be done in two ways:
|
sounds reasonable to me. Or do a separate command to instantiate it? |
Adding a fix for instantiated datasets pulled from Studio.
In addition, it adds small improvement for avoiding going to Studio if dataset with specified version is already present locally