-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DataChain.export_files(...)
#30
Conversation
@ilongin could you please provide more context? Why is it a priority now? |
@dmpetrov Issue was in Kanban board and you mentioned its needed for public release and its marked as P1 https://github.com/iterative/dvcx/issues/1721 Am I missing something? |
oh, right! Sorry. Completely forgot about this 😅 A link to an issue might help. |
I encourage you implementing this in high level API, not in the core. If you think that the new File API is a blocker - it's better to postpone implementing this. |
@dmpetrov the only thing I can think of right now that could be moved to high level API is exporting single file function. def export(self, output: str, force: bool = False):
.... And in I feel like this is generic enough for someone to implement exporting from his custom source for example. WDYT? |
Right. Please make sure it does not effect any code in the core, only in In additional to this, it would be greaat to have mapper_function that can be called independently like TO_DIR = "my_dir/"
dc.map(res=lambda file: file.export(TO_DIR)).execute()` |
@dmpetrov btw regarding output directory and paths to specific files, how do you want that too look like? We need to have bucket name in the path to avoid collisions (there can be a file with same path and name in multiple buckets) but I think we should also have one directory for protocol because there can be buckets with the same name in multiple clouds as well? For example with this script: from datachain.lib.dc import C, DataChain
ds = (
DataChain.from_storage("s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
.filter(C.name.glob("*cat*"))
.export_files("cats_output")
) Produces this file paths (note that I should get rid of that colon after
Is this ok or you want to change something? |
Deploying datachain-documentation with Cloudflare Pages
|
Good question! We have to support multiple strategies:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #30 +/- ##
==========================================
+ Coverage 84.36% 84.44% +0.08%
==========================================
Files 94 94
Lines 9470 9511 +41
Branches 1872 1883 +11
==========================================
+ Hits 7989 8032 +43
Misses 1151 1151
+ Partials 330 328 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
strategy is a bit too smart. It can be just |
But in case of |
I named it this way by a mistake. Please take a look at the updated comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loks good! a few small comments are inline
@@ -434,6 +434,40 @@ def test_select_except(cloud_test_catalog): | |||
] | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"cloud_type,version_aware", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check distinct without File? It should not touch other parts if there is no need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to distinct on a list of integers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, I would leave this for separated issue as there are multiple tests that could be refactored in this way in this file
@@ -1407,6 +1413,12 @@ def offset(self, offset: int) -> "Self": | |||
query.steps.append(SQLOffset(offset)) | |||
return query | |||
|
|||
@detach | |||
def distinct(self) -> "Self": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature is not comprehensive
The main use case for distinct() in the datasets is removal of duplicate entries - for that, the function should take signal (or signal list) as an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! @ilongin could you please implement this as a follow up issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will create a followup issue. It seems like we need sometning like PostgreSQL specific DISTINCT ON which is not available in SQLite though (it has just "normal" distinct which returns unique column(s)) where we will prob need to implement it with group by or something else under the hood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.select(f"{signal}.name").distinct().count() != self.count():
raise ValueError("Files with the same name found")
This statement might not ideal for two resons:
- There might be an issue if the original dataset contains duplicates (we cannot guarantee it's not).
- It does count two times().
This seems like group by with a count is the right way to solve this, not distinct.
Fixes: https://github.com/iterative/dvcx/issues/1721
Adds:
File
method calledexport()
which exports file to desired outputDataChain
method calledexport_files()
which export all files from it to desired location and based on strategy of how to create file paths (full paths, just filenames or just filenames but as etags + extension)distinct()
inDatasetQuery
Example: