Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Transform Chaining, Lazy Mapping #6012

Open
NightMachinery opened this issue Jul 9, 2023 · 9 comments
Open

[FR] Transform Chaining, Lazy Mapping #6012

NightMachinery opened this issue Jul 9, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@NightMachinery
Copy link
Contributor

Feature request

Currently using a map call processes and duplicates the whole dataset, which takes both time and disk space.

The solution is to allow lazy mapping, which is essentially a saved chain of transforms that are applied on the fly whenever a slice of the dataset is requested.

The API should look like map, as set_transform changes the current dataset while map returns another dataset.

Motivation

Lazy processing allows lower disk usage and faster experimentation.

Your contribution

_

@mariosasko
Copy link
Collaborator

You can use with_transform to get a new dataset object.

Support for lazy map has already been discussed here a little bit. Personally, I'm not a fan, as this would make map even more complex.

@NightMachinery
Copy link
Contributor Author

You can use with_transform to get a new dataset object.

Support for lazy map has already been discussed here a little bit. Personally, I'm not a fan, as this would make map even more complex.

I read about IterableDataset, and it seems to have lazy mapping. But I can't figure out how to convert an IterableDataset into a normal one when needed.

with_transform still does not chain AFAIU.

@mariosasko
Copy link
Collaborator

I read about IterableDataset, and it seems to have lazy mapping. But I can't figure out how to convert an IterableDataset into a normal one when needed.

You must cache an IterableDataset to disk to load it as a Dataset. One way to do this is with Dataset.from_generator:

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})

with_transform still does not chain AFAIU.

Yes, not supported yet - the solution is to combine the transforms into a single one.

@lhoestq
Copy link
Member

lhoestq commented Jul 11, 2023

I wonder if it would be beneficial to have a dedicated method to do that ? Maybe a .save_to_disk() so that the user can reload the resulting dataset later ?

@NightMachinery
Copy link
Contributor Author

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})

@mariosasko With these complex mapping functions, what hash will be used to cache this dataset?

@mariosasko
Copy link
Collaborator

The params passed to Dataset.from_generator will be used to compute the hash (partial encapsulates the iterable_ds value, so changing it will also change the hash)

@kkoutini
Copy link
Contributor

Hi, I think this feature would be very useful. I want to concatenate large datasets with heterogeneous columns. I dislike map since I don't want multiple copy of that datasets locally. I tried to use "set_transform" on each dataset to convert it to a standard features format, but datasets.concatenate_datasets ignores the updated format of the datasets.  A work around is to use torch.utils.data.ConcatDataset. Is there a neat way to do it using HF datasets?

@luowyang
Copy link

luowyang commented Jan 18, 2025

@mariosasko These features would be handy for large datasets. A typical use case is video datasets: We have millions of videos, each stored in some OSS so they require some custom loading logic.

  1. Due to the memory limit, loading the videos a priori into the memory is infeasible. But we can postpone video loading until they are needed with lazy mapping.
  2. With chained transforms, we can allow the users to specify their custom video preprocessing logic while keeping the loading logic the same.

@lhoestq
Copy link
Member

lhoestq commented Jan 20, 2025

FYI lazy map is available for IterableDataset(map is applied on-the-fly when iterating on the dataset):

ds = load_dataset(...streaming=True)
# or
ds = Dataset.from_list(...).to_iterable_dataset()
# or
ds = IterableDataset.from_generator(...)

# Then you can chain many map/filter/shuffle/etc.
ds = ds.map(...).filter(...).map(...)

# The map functions are applied on-the-fly when iterating on the dataset
for example in ds:
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants