[FR] Transform Chaining, Lazy Mapping #6012

NightMachinery · 2023-07-09T21:40:21Z

Feature request

Currently using a map call processes and duplicates the whole dataset, which takes both time and disk space.

The solution is to allow lazy mapping, which is essentially a saved chain of transforms that are applied on the fly whenever a slice of the dataset is requested.

The API should look like map, as set_transform changes the current dataset while map returns another dataset.

Motivation

Lazy processing allows lower disk usage and faster experimentation.

Your contribution

_

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-07-10T14:54:57Z

You can use with_transform to get a new dataset object.

Support for lazy map has already been discussed here a little bit. Personally, I'm not a fan, as this would make map even more complex.

NightMachinery · 2023-07-10T14:58:24Z

You can use with_transform to get a new dataset object.

Support for lazy map has already been discussed here a little bit. Personally, I'm not a fan, as this would make map even more complex.

I read about IterableDataset, and it seems to have lazy mapping. But I can't figure out how to convert an IterableDataset into a normal one when needed.

with_transform still does not chain AFAIU.

mariosasko · 2023-07-10T15:37:25Z

I read about IterableDataset, and it seems to have lazy mapping. But I can't figure out how to convert an IterableDataset into a normal one when needed.

You must cache an IterableDataset to disk to load it as a Dataset. One way to do this is with Dataset.from_generator:

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})

with_transform still does not chain AFAIU.

Yes, not supported yet - the solution is to combine the transforms into a single one.

lhoestq · 2023-07-11T16:19:57Z

I wonder if it would be beneficial to have a dedicated method to do that ? Maybe a .save_to_disk() so that the user can reload the resulting dataset later ?

NightMachinery · 2023-07-13T19:52:20Z

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})

@mariosasko With these complex mapping functions, what hash will be used to cache this dataset?

mariosasko · 2023-07-14T13:12:39Z

The params passed to Dataset.from_generator will be used to compute the hash (partial encapsulates the iterable_ds value, so changing it will also change the hash)

kkoutini · 2023-11-23T10:08:57Z

Hi, I think this feature would be very useful. I want to concatenate large datasets with heterogeneous columns. I dislike map since I don't want multiple copy of that datasets locally. I tried to use "set_transform" on each dataset to convert it to a standard features format, but datasets.concatenate_datasets ignores the updated format of the datasets. A work around is to use torch.utils.data.ConcatDataset. Is there a neat way to do it using HF datasets?

luowyang · 2025-01-18T17:31:20Z

@mariosasko These features would be handy for large datasets. A typical use case is video datasets: We have millions of videos, each stored in some OSS so they require some custom loading logic.

Due to the memory limit, loading the videos a priori into the memory is infeasible. But we can postpone video loading until they are needed with lazy mapping.
With chained transforms, we can allow the users to specify their custom video preprocessing logic while keeping the loading logic the same.

lhoestq · 2025-01-20T14:06:27Z

FYI lazy map is available for IterableDataset(map is applied on-the-fly when iterating on the dataset):

ds = load_dataset(...streaming=True)
# or
ds = Dataset.from_list(...).to_iterable_dataset()
# or
ds = IterableDataset.from_generator(...)

# Then you can chain many map/filter/shuffle/etc.
ds = ds.map(...).filter(...).map(...)

# The map functions are applied on-the-fly when iterating on the dataset
for example in ds:
    ...

NightMachinery added the enhancement New feature or request label Jul 9, 2023

NightMachinery mentioned this issue Jul 9, 2023

Cannot use both set_format and set_transform #5910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Transform Chaining, Lazy Mapping #6012

[FR] Transform Chaining, Lazy Mapping #6012

NightMachinery commented Jul 9, 2023

mariosasko commented Jul 10, 2023

NightMachinery commented Jul 10, 2023

mariosasko commented Jul 10, 2023

lhoestq commented Jul 11, 2023

NightMachinery commented Jul 13, 2023

mariosasko commented Jul 14, 2023

kkoutini commented Nov 23, 2023

luowyang commented Jan 18, 2025 •

edited

Loading

lhoestq commented Jan 20, 2025

[FR] Transform Chaining, Lazy Mapping #6012

[FR] Transform Chaining, Lazy Mapping #6012

Comments

NightMachinery commented Jul 9, 2023

Feature request

Motivation

Your contribution

mariosasko commented Jul 10, 2023

NightMachinery commented Jul 10, 2023

mariosasko commented Jul 10, 2023

lhoestq commented Jul 11, 2023

NightMachinery commented Jul 13, 2023

mariosasko commented Jul 14, 2023

kkoutini commented Nov 23, 2023

luowyang commented Jan 18, 2025 • edited Loading

lhoestq commented Jan 20, 2025

luowyang commented Jan 18, 2025 •

edited

Loading