Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename export_files() to to_storage() #859

Closed
2 tasks
dmpetrov opened this issue Jan 26, 2025 · 1 comment · Fixed by #922
Closed
2 tasks

Rename export_files() to to_storage() #859

dmpetrov opened this issue Jan 26, 2025 · 1 comment · Fixed by #922
Assignees
Labels
enhancement New feature or request

Comments

@dmpetrov
Copy link
Member

Description

export_files() needs to be renamed in order to get a consistent naming with from_storages(), to_parque(), to_json() etc. It's also need too support all params acceptable in from_storage() like cloud paths s3://mybkt/dir1/

  • Rename to to_storage()
  • Support clouds

Open question: should we do a bit opposite - rename from_storage() --> from_files(), export_files() --> to_files() in order to get a consistency with from_parquet(), from_json(). It seems like a cleaner way.

@iterative/datachain WDYT folks?

def predict_dog(File: file) -> float:
   model = tf.keras.applications.ResNet50(weights='imagenet')
   img = tf.keras.preprocessing.image.load_img(io.BytesIO(file.read()), target_size=(224, 224))
   pred = model.predict(tf.expand_dims(tf.keras.preprocessing.image.img_to_array(img), axis=0))
   return sum(pred[0][151:269])

(
    DataChain
    .from_storage("s3://mybucket/data/*.jpg'")
    .map(dog_score=predict_dog)
    .filter(C("dog_score") > 0.96)
    .to_storage("s3://mybucket/out/dogs")
)

PS:

  • it's ok to break compatibility with export_file since it always was a workaround
@dmpetrov dmpetrov added the enhancement New feature or request label Jan 26, 2025
@skshetry
Copy link
Member

One thing to keep in mind is that export_files also supports checking out (for a lack of a better term) with different names.

ExportPlacement = Literal["filename", "etag", "fullpath", "checksum"]

I don't particularly like the to_storage suggestion. In my mental model, from_storage creates a dataset from a storage. This will have a signal called file, which basically has a pointer to the remote file. But it could have other things besides the file signal, or not have one at all.
to_storage would be something that dumps the index/dataset to the storage.

If I ignore different placement strategies, the current function could be best explained as a pull or a fetch. But if we support clouds, I don't know what to call it - maybe copy or copy_to, or just pull (or, is it push?). 😅

@amritghimire amritghimire self-assigned this Feb 11, 2025
amritghimire added a commit that referenced this issue Feb 12, 2025
This renames the export files to to storage adding the support for the
cloud destinations as well.

With this change, the following code will work:
```py
from datachain.lib.dc import DataChain

ds = DataChain.from_storage("az://amrit-test-az/image.png")
ds.save("az")

ds.to_storage("gs://amrit-datachain-test/destination", placement="filename")

```

Closes #859
amritghimire added a commit that referenced this issue Feb 13, 2025
This renames the export files to to storage adding the support for the
cloud destinations as well.

With this change, the following code will work:
```py
from datachain.lib.dc import DataChain

ds = DataChain.from_storage("az://amrit-test-az/image.png")
ds.save("az")

ds.to_storage("gs://amrit-datachain-test/destination", placement="filename")

```

Closes #859
amritghimire added a commit that referenced this issue Feb 14, 2025
This renames the export files to to storage adding the support for the
cloud destinations as well.

With this change, the following code will work:
```py
from datachain.lib.dc import DataChain

ds = DataChain.from_storage("az://amrit-test-az/image.png")
ds.save("az")

ds.to_storage("gs://amrit-datachain-test/destination", placement="filename")

```

Closes #859
amritghimire added a commit that referenced this issue Feb 17, 2025
This renames the export files to to storage adding the support for the
cloud destinations as well.

With this change, the following code will work:
```py
from datachain.lib.dc import DataChain

ds = DataChain.from_storage("az://amrit-test-az/image.png")
ds.save("az")

ds.to_storage("gs://amrit-datachain-test/destination", placement="filename")

```

Closes #859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants