Adding `DataChain.export_files(...)` #30

ilongin · 2024-07-13T00:32:11Z

Fixes: https://github.com/iterative/dvcx/issues/1721

Adds:

New File method called export() which exports file to desired output
New DataChain method called export_files() which export all files from it to desired location and based on strategy of how to create file paths (full paths, just filenames or just filenames but as etags + extension)
New SQL method distinct() in DatasetQuery

Example:

from datachain.lib.dc import C, DataChain

ds = (
    DataChain.from_storage("s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
    .filter(C.name.glob("*cat*"))
    .export_files("cats_output", strategy="filename")
)

# this also works
ds.map(res=lambda file: file.export("cats_output", strategy="filename")).exec()

dmpetrov · 2024-07-13T04:32:54Z

@ilongin could you please provide more context? Why is it a priority now?

ilongin · 2024-07-13T04:47:16Z

@ilongin could you please provide more context? Why is it a priority now?

@dmpetrov Issue was in Kanban board and you mentioned its needed for public release and its marked as P1 https://github.com/iterative/dvcx/issues/1721

Am I missing something?

dmpetrov · 2024-07-13T04:52:51Z

oh, right! Sorry. Completely forgot about this 😅 A link to an issue might help.

dmpetrov · 2024-07-13T04:59:40Z

I encourage you implementing this in high level API, not in the core.
We will be moving out all file operations from the core - #33

If you think that the new File API is a blocker - it's better to postpone implementing this.

ilongin · 2024-07-13T22:33:08Z

I encourage you implementing this in high level API, not in the core. We will be moving out all file operations from the core - #33

If you think that the new File API is a blocker - it's better to postpone implementing this.

@dmpetrov the only thing I can think of right now that could be moved to high level API is exporting single file function.
So In File we would have export method:

def export(self, output: str, force: bool = False):
    ....

And in DataChain.export() (core) we would have the logic of getting those file objects based on signal name and calling file.export(...) on each. Core would also take care of running it in parallel (this could be configured through args as well btw).

I feel like this is generic enough for someone to implement exporting from his custom source for example.

WDYT?

dmpetrov · 2024-07-13T23:29:19Z

So In File we would have export method:

Right. Please make sure it does not effect any code in the core, only in lib. Progress bar is not needed since ih has to be handled by mapper progressbar.

In additional to this, it would be greaat to have mapper_function that can be called independently like

TO_DIR = "my_dir/"
dc.map(res=lambda file: file.export(TO_DIR)).execute()`

ilongin · 2024-07-15T15:09:50Z

@dmpetrov btw regarding output directory and paths to specific files, how do you want that too look like? We need to have bucket name in the path to avoid collisions (there can be a file with same path and name in multiple buckets) but I think we should also have one directory for protocol because there can be buckets with the same name in multiple clouds as well?

For example with this script:

from datachain.lib.dc import C, DataChain

ds = (
    DataChain.from_storage("s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
    .filter(C.name.glob("*cat*"))
    .export_files("cats_output")
)

Produces this file paths (note that I should get rid of that colon after s3):

cats_output/s3:/ldb-public/remote/data-lakes/dogs-and-cats/cat1.jpg
cats_output/s3:/ldb-public/remote/data-lakes/dogs-and-cats/cat2.jpg
...

Is this ok or you want to change something?

cloudflare-workers-and-pages · 2024-07-16T08:19:01Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`8fffba0`
Status:	✅ Deploy successful!
Preview URL:	https://cf76212a.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1721-datachain-expor.datachain-documentation.pages.dev

View logs

dmpetrov · 2024-07-16T08:23:15Z

regarding output directory and paths to specific files

Good question! We have to support multiple strategies:

"fullpath" - looks good for default strategy but please do not include s3 prefixes. It should be cats_output/ldb-public/remote/data-lakes/dogs-and-cats/cat1.jpg
- we can support optional prefix in to shrink the path like .export_files("./myout", prefix="ldb-public/remote/data-lakes/") to generate ./myout/dogs-and-cats/cat1.jpg
"filename" - keep only filenames. You will need to precheck uniqueness of the names (do we have distinct function in our API to check that?).
"etag" - rename filename to etag but keep extension. collisions are ok.
(not a priority for now) "checksum" - same as etag but calculating checksum

def export_files(dir: str, strategy: Literal["fullpath", "filename", "etag", "checksum"]="fullpath", prefix: str = "")

codecov · 2024-07-16T08:39:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.44%. Comparing base (0b1da6b) to head (8fffba0).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
+ Coverage   84.36%   84.44%   +0.08%     
==========================================
  Files          94       94              
  Lines        9470     9511      +41     
  Branches     1872     1883      +11     
==========================================
+ Hits         7989     8032      +43     
  Misses       1151     1151              
+ Partials      330      328       -2

Flag	Coverage Δ
datachain	`84.38% <100.00%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dmpetrov · 2024-07-16T08:40:15Z

strategy

strategy is a bit too smart. It can be just filename=....

ilongin · 2024-07-16T09:37:06Z

strategy

strategy is a bit too smart. It can be just filename=....

But in case of storage_path strategy it's also about creating multiple levels of directories so we cannot name it just filename= .. I would leave strategy ...
Maybe we can have filepath: Literal["filename", "etag", "storage_path"]="storage_path"

dmpetrov · 2024-07-16T19:21:00Z

storage_path

I named it this way by a mistake. Please take a look at the updated comment.

dmpetrov

Loks good! a few small comments are inline

src/datachain/lib/file.py

src/datachain/query/dataset.py

src/datachain/lib/dc.py

dmpetrov · 2024-07-17T21:19:21Z

tests/func/test_dataset_query.py

@@ -434,6 +434,40 @@ def test_select_except(cloud_test_catalog):
    ]


+@pytest.mark.parametrize(
+    "cloud_type,version_aware",


Can we check distinct without File? It should not touch other parts if there is no need

I'd try to distinct on a list of integers...

As we discussed, I would leave this for separated issue as there are multiple tests that could be refactored in this way in this file

tests/unit/lib/test_file.py

volkfox · 2024-07-18T18:01:17Z

src/datachain/query/dataset.py

@@ -1407,6 +1413,12 @@ def offset(self, offset: int) -> "Self":
        query.steps.append(SQLOffset(offset))
        return query

+    @detach
+    def distinct(self) -> "Self":


This signature is not comprehensive

The main use case for distinct() in the datasets is removal of duplicate entries - for that, the function should take signal (or signal list) as an argument

right! @ilongin could you please implement this as a follow up issue?

Created #89

Yes, I will create a followup issue. It seems like we need sometning like PostgreSQL specific DISTINCT ON which is not available in SQLite though (it has just "normal" distinct which returns unique column(s)) where we will prob need to implement it with group by or something else under the hood

if self.select(f"{signal}.name").distinct().count() != self.count(): raise ValueError("Files with the same name found")

This statement might not ideal for two resons:

There might be an issue if the original dataset contains duplicates (we cannot guarantee it's not).

It does count two times().

This seems like group by with a count is the right way to solve this, not distinct.

ilongin added 4 commits July 13, 2024 02:24

added logic to export files from chain

facabdb

added logic to export files from chain

21e95d5

fixing mypy

a47f219

fixing progress bar

f382739

ilongin marked this pull request as draft July 13, 2024 00:35

Merge branch 'main' into ilongin/1721-datachain-export-files

c8625cc

ilongin added 3 commits July 15, 2024 17:16

added simplified method to export files

8ba52a7

Merge branch 'main' into ilongin/1721-datachain-export-files

4f0b2df

refactoring output directory generation

9b0fca1

ilongin added 2 commits July 16, 2024 10:23

returning old fetcher classes and files

3217021

returning old fetcher classes and files

90c3ac1

ilongin added 2 commits July 16, 2024 14:24

added file export strategy and distinct method in dataset query

86355dd

adding tests

f66cac3

ilongin marked this pull request as ready for review July 16, 2024 15:27

ilongin added 3 commits July 16, 2024 23:25

fixing strategy type

55b2253

merging with main

bbba52d

using posixpath

2285554

ilongin added 13 commits July 17, 2024 01:07

added tests for export_files

bb6fb50

addef file fixtures

505b1e7

merging with main

0a3fa78

using fixtures in tests

e82a48a

adding tests

e04d0dd

remved tests

6a9c43b

remved tests

406feb6

added print

4c91fdb

better prints

e630077

fixing tests

175d40d

fixing tests

4775cb1

removed only one test

caaaa5f

removed not needed function

a8ceb22

ilongin requested review from dmpetrov, dtulga, rlamy, skshetry and mattseddon July 17, 2024 14:18

dmpetrov approved these changes Jul 17, 2024

View reviewed changes

ilongin added 5 commits July 18, 2024 09:01

merging with main

35b47b5

added using cache

1b1f15a

added using cache

9459eb6

refactored tests and added print

cdd6a87

renamed strategy to placement

8fffba0

ilongin merged commit 9862211 into main Jul 18, 2024
19 checks passed

ilongin deleted the ilongin/1721-datachain-export-files branch July 18, 2024 08:20

rlamy mentioned this pull request Jul 18, 2024

Remove legacy signals in from_storage() #72

Merged

volkfox reviewed Jul 18, 2024

View reviewed changes

dmpetrov mentioned this pull request Jul 18, 2024

distinct() for subset of columns #89

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `DataChain.export_files(...)` #30

Adding `DataChain.export_files(...)` #30

ilongin commented Jul 13, 2024 •

edited

Loading

dmpetrov commented Jul 13, 2024

ilongin commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

ilongin commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

ilongin commented Jul 15, 2024

cloudflare-workers-and-pages bot commented Jul 16, 2024 •

edited

Loading

dmpetrov commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading

dmpetrov commented Jul 16, 2024

ilongin commented Jul 16, 2024 •

edited

Loading

dmpetrov commented Jul 16, 2024

dmpetrov left a comment

dmpetrov Jul 17, 2024

dmpetrov Jul 17, 2024

ilongin Jul 18, 2024

volkfox Jul 18, 2024

dmpetrov Jul 18, 2024

dmpetrov Jul 18, 2024

ilongin Jul 18, 2024

dmpetrov Jul 19, 2024 •

edited

Loading

Adding DataChain.export_files(...) #30

Adding DataChain.export_files(...) #30

Conversation

ilongin commented Jul 13, 2024 • edited Loading

dmpetrov commented Jul 13, 2024

ilongin commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

ilongin commented Jul 13, 2024

dmpetrov commented Jul 13, 2024

ilongin commented Jul 15, 2024

cloudflare-workers-and-pages bot commented Jul 16, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

dmpetrov commented Jul 16, 2024 • edited Loading

codecov bot commented Jul 16, 2024 • edited Loading

Codecov Report

dmpetrov commented Jul 16, 2024

ilongin commented Jul 16, 2024 • edited Loading

dmpetrov commented Jul 16, 2024

dmpetrov left a comment

Choose a reason for hiding this comment

dmpetrov Jul 17, 2024

Choose a reason for hiding this comment

dmpetrov Jul 17, 2024

Choose a reason for hiding this comment

ilongin Jul 18, 2024

Choose a reason for hiding this comment

volkfox Jul 18, 2024

Choose a reason for hiding this comment

dmpetrov Jul 18, 2024

Choose a reason for hiding this comment

dmpetrov Jul 18, 2024

Choose a reason for hiding this comment

ilongin Jul 18, 2024

Choose a reason for hiding this comment

dmpetrov Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Adding `DataChain.export_files(...)` #30

Adding `DataChain.export_files(...)` #30

ilongin commented Jul 13, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 16, 2024 •

edited

Loading

dmpetrov commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading

ilongin commented Jul 16, 2024 •

edited

Loading

dmpetrov Jul 19, 2024 •

edited

Loading