-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Transform Chaining, Lazy Mapping #6012
Comments
You can use Support for lazy |
I read about IterableDataset, and it seems to have lazy mapping. But I can't figure out how to convert an IterableDataset into a normal one when needed.
|
You must cache an from functools import partial
from datasets import Dataset
def gen_from_iterable_dataset(iterable_ds)
yield from iterable_ds
ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})
Yes, not supported yet - the solution is to combine the transforms into a single one. |
I wonder if it would be beneficial to have a dedicated method to do that ? Maybe a |
@mariosasko With these complex mapping functions, what hash will be used to cache this dataset? |
The params passed to |
Hi, I think this feature would be very useful. I want to concatenate large datasets with heterogeneous columns. I dislike |
@mariosasko These features would be handy for large datasets. A typical use case is video datasets: We have millions of videos, each stored in some OSS so they require some custom loading logic.
|
FYI lazy map is available for ds = load_dataset(...streaming=True)
# or
ds = Dataset.from_list(...).to_iterable_dataset()
# or
ds = IterableDataset.from_generator(...)
# Then you can chain many map/filter/shuffle/etc.
ds = ds.map(...).filter(...).map(...)
# The map functions are applied on-the-fly when iterating on the dataset
for example in ds:
... |
Feature request
Currently using a
map
call processes and duplicates the whole dataset, which takes both time and disk space.The solution is to allow lazy mapping, which is essentially a saved chain of transforms that are applied on the fly whenever a slice of the dataset is requested.
The API should look like
map
, asset_transform
changes the current dataset whilemap
returns another dataset.Motivation
Lazy processing allows lower disk usage and faster experimentation.
Your contribution
_
The text was updated successfully, but these errors were encountered: