Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DataChain.export_files(...) #30

Merged
merged 33 commits into from
Jul 18, 2024
Merged

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jul 13, 2024

Fixes: https://github.com/iterative/dvcx/issues/1721

Adds:

  • New File method called export() which exports file to desired output
  • New DataChain method called export_files() which export all files from it to desired location and based on strategy of how to create file paths (full paths, just filenames or just filenames but as etags + extension)
  • New SQL method distinct() in DatasetQuery

Example:

from datachain.lib.dc import C, DataChain

ds = (
    DataChain.from_storage("s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
    .filter(C.name.glob("*cat*"))
    .export_files("cats_output", strategy="filename")
)

# this also works
ds.map(res=lambda file: file.export("cats_output", strategy="filename")).exec()

@ilongin ilongin marked this pull request as draft July 13, 2024 00:35
@dmpetrov
Copy link
Member

@ilongin could you please provide more context? Why is it a priority now?

@ilongin
Copy link
Contributor Author

ilongin commented Jul 13, 2024

@ilongin could you please provide more context? Why is it a priority now?

@dmpetrov Issue was in Kanban board and you mentioned its needed for public release and its marked as P1 https://github.com/iterative/dvcx/issues/1721

Am I missing something?

@dmpetrov
Copy link
Member

oh, right! Sorry. Completely forgot about this 😅 A link to an issue might help.

@dmpetrov
Copy link
Member

I encourage you implementing this in high level API, not in the core.
We will be moving out all file operations from the core - #33

If you think that the new File API is a blocker - it's better to postpone implementing this.

@ilongin
Copy link
Contributor Author

ilongin commented Jul 13, 2024

I encourage you implementing this in high level API, not in the core. We will be moving out all file operations from the core - #33

If you think that the new File API is a blocker - it's better to postpone implementing this.

@dmpetrov the only thing I can think of right now that could be moved to high level API is exporting single file function.
So In File we would have export method:

def export(self, output: str, force: bool = False):
    ....

And in DataChain.export() (core) we would have the logic of getting those file objects based on signal name and calling file.export(...) on each. Core would also take care of running it in parallel (this could be configured through args as well btw).

I feel like this is generic enough for someone to implement exporting from his custom source for example.

WDYT?

@dmpetrov
Copy link
Member

So In File we would have export method:

Right. Please make sure it does not effect any code in the core, only in lib. Progress bar is not needed since ih has to be handled by mapper progressbar.

In additional to this, it would be greaat to have mapper_function that can be called independently like

TO_DIR = "my_dir/"
dc.map(res=lambda file: file.export(TO_DIR)).execute()`

@ilongin
Copy link
Contributor Author

ilongin commented Jul 15, 2024

@dmpetrov btw regarding output directory and paths to specific files, how do you want that too look like? We need to have bucket name in the path to avoid collisions (there can be a file with same path and name in multiple buckets) but I think we should also have one directory for protocol because there can be buckets with the same name in multiple clouds as well?

For example with this script:

from datachain.lib.dc import C, DataChain

ds = (
    DataChain.from_storage("s3://ldb-public/remote/data-lakes/dogs-and-cats/", anon=True)
    .filter(C.name.glob("*cat*"))
    .export_files("cats_output")
)

Produces this file paths (note that I should get rid of that colon after s3):

cats_output/s3:/ldb-public/remote/data-lakes/dogs-and-cats/cat1.jpg
cats_output/s3:/ldb-public/remote/data-lakes/dogs-and-cats/cat2.jpg
...

Is this ok or you want to change something?

Copy link

cloudflare-workers-and-pages bot commented Jul 16, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8fffba0
Status: ✅  Deploy successful!
Preview URL: https://cf76212a.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1721-datachain-expor.datachain-documentation.pages.dev

View logs

@dmpetrov
Copy link
Member

dmpetrov commented Jul 16, 2024

regarding output directory and paths to specific files

Good question! We have to support multiple strategies:

  • "fullpath" - looks good for default strategy but please do not include s3 prefixes. It should be cats_output/ldb-public/remote/data-lakes/dogs-and-cats/cat1.jpg
    • we can support optional prefix in to shrink the path like .export_files("./myout", prefix="ldb-public/remote/data-lakes/") to generate ./myout/dogs-and-cats/cat1.jpg
  • "filename" - keep only filenames. You will need to precheck uniqueness of the names (do we have distinct function in our API to check that?).
  • "etag" - rename filename to etag but keep extension. collisions are ok.
  • (not a priority for now) "checksum" - same as etag but calculating checksum
def export_files(dir: str, strategy: Literal["fullpath", "filename", "etag", "checksum"]="fullpath", prefix: str = "")

Copy link

codecov bot commented Jul 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.44%. Comparing base (0b1da6b) to head (8fffba0).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
+ Coverage   84.36%   84.44%   +0.08%     
==========================================
  Files          94       94              
  Lines        9470     9511      +41     
  Branches     1872     1883      +11     
==========================================
+ Hits         7989     8032      +43     
  Misses       1151     1151              
+ Partials      330      328       -2     
Flag Coverage Δ
datachain 84.38% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dmpetrov
Copy link
Member

strategy

strategy is a bit too smart. It can be just filename=....

@ilongin
Copy link
Contributor Author

ilongin commented Jul 16, 2024

strategy

strategy is a bit too smart. It can be just filename=....

But in case of storage_path strategy it's also about creating multiple levels of directories so we cannot name it just filename= .. I would leave strategy ...
Maybe we can have filepath: Literal["filename", "etag", "storage_path"]="storage_path"

@ilongin ilongin marked this pull request as ready for review July 16, 2024 15:27
@dmpetrov
Copy link
Member

storage_path

I named it this way by a mistake. Please take a look at the updated comment.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loks good! a few small comments are inline

src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Show resolved Hide resolved
src/datachain/query/dataset.py Show resolved Hide resolved
src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@@ -434,6 +434,40 @@ def test_select_except(cloud_test_catalog):
]


@pytest.mark.parametrize(
"cloud_type,version_aware",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check distinct without File? It should not touch other parts if there is no need

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to distinct on a list of integers...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I would leave this for separated issue as there are multiple tests that could be refactored in this way in this file

tests/unit/lib/test_file.py Outdated Show resolved Hide resolved
@ilongin ilongin merged commit 9862211 into main Jul 18, 2024
19 checks passed
@ilongin ilongin deleted the ilongin/1721-datachain-export-files branch July 18, 2024 08:20
@@ -1407,6 +1413,12 @@ def offset(self, offset: int) -> "Self":
query.steps.append(SQLOffset(offset))
return query

@detach
def distinct(self) -> "Self":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature is not comprehensive

The main use case for distinct() in the datasets is removal of duplicate entries - for that, the function should take signal (or signal list) as an argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right! @ilongin could you please implement this as a follow up issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #89

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will create a followup issue. It seems like we need sometning like PostgreSQL specific DISTINCT ON which is not available in SQLite though (it has just "normal" distinct which returns unique column(s)) where we will prob need to implement it with group by or something else under the hood

Copy link
Member

@dmpetrov dmpetrov Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.select(f"{signal}.name").distinct().count() != self.count():
                raise ValueError("Files with the same name found")

This statement might not ideal for two resons:

  1. There might be an issue if the original dataset contains duplicates (we cannot guarantee it's not).
  2. It does count two times().

This seems like group by with a count is the right way to solve this, not distinct.

@dmpetrov dmpetrov mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants