None batched `with_transform`, `set_transform` #3385

cccntu · 2021-12-06T05:20:54Z

Is your feature request related to a problem? Please describe.

A torch.utils.data.Dataset.__getitem__ operates on a single example.
But 🤗 Datasets.with_transform doesn't seem to allow non-batched transform.

Describe the solution you'd like

Have a batched=True argument in Datasets.with_transform

Describe alternatives you've considered

Convert a non-batched transform function to batched one myself.
Wrap a 🤗 Dataset with torch Dataset, and add a __getitem__. 🙄
Have lazy=False in Dataset.map, and returns a LazyDataset if lazy=True. This way the same map interface can be used, and existing code can be updated with one argument change.

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-12-14T14:24:18Z

Hi ! Thanks for the suggestion :)
It makes sense to me, and it can surely be implemented by wrapping the user's function to make it a batched function. However I'm not a big fan of the inconsistency it would create with map: with_transform is batched by default while map isn't.

Is there something you would like to contribute ? I can give you some pointers if you want

cccntu · 2022-01-10T11:37:58Z

Hi @lhoestq ,
Sorry I missed your reply.

I would love to contribute. But I don't know which solution would be the best for this repo.

However I'm not a big fan of the inconsistency it would create with map: with_transform is batched by default while map isn't.

I agree. What do you think about the alternative solutions?

Convert a non-batched transform function to batched one myself.

This won't be able to use torch loader multi-worker.

Wrap a 🤗 Dataset with torch Dataset, and add a getitem. 🙄

This is actually pretty simple.

import torch

class LazyMapTorchDataset(torch.utils.data.Dataset):
    def __init__(self, ds, fn):
        self.ds = ds
        self.fn = fn
    def __getitem__(self, i):
        return self.fn(self.ds[i])

d = [{1:2, 2:3}, {1:3, 2:4}]
ds = LazyMapTorchDataset(d, lambda x:{k:v*2 for k,v in x.items()})
for i in range(2):
    print(f'before {d[i]}')
    print(f'after {ds[i]}')

before {1: 2, 2: 3}
after {1: 4, 2: 6}
before {1: 3, 2: 4}
after {1: 6, 2: 8}

But this requires converting data to torch tensor myself. And this is really similar to .map(), why not just use it? So I have the next solution.

Have lazy=False in Dataset.map, and returns a LazyDataset if lazy=True. This way the same map interface can be used, and existing code can be updated with one argument change.

I think I like this solution best. Because .with_transform is entangled with .with_format, so seems more flexible to modify the .map than to modify .with_transform.

The usage looks nice, too.

# lazy, one to one, can be parallelized via torch loader, no need to set `num_worker` beforehand.
dataset = dataset.map(fn, lazy=True, batched=False)
# collate_fn
dataloader = Dataloader(dataset.with_format('torch'), collate_fn=collate_fn, num_workers=...)

There are some minor decisions like whether a lazy map should be allowed before another map, but I think we can work it out later. The implementation can probably borrow from IterableDataset.

lhoestq · 2022-01-17T15:25:01Z

I like the idea of lazy map. On the other hand we should only have either lazy map or with_transform (not both). That's why I'd rather stick with with_transform for now (but maybe we can consider it for later major releases like datasets v2).

I understand the issue with with_transform and with_format being exclusive, maybe we can separate them: first transform, them format.

Finally I think what's also going to be important in the end will be the addition of multiprocessing to transforms

cccntu added the enhancement New feature or request label Dec 6, 2021

mariosasko mentioned this issue Jul 10, 2023

[FR] Transform Chaining, Lazy Mapping #6012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None batched `with_transform`, `set_transform` #3385

None batched `with_transform`, `set_transform` #3385

cccntu commented Dec 6, 2021

lhoestq commented Dec 14, 2021

cccntu commented Jan 10, 2022 •

edited

Loading

lhoestq commented Jan 17, 2022

None batched with_transform, set_transform #3385

None batched with_transform, set_transform #3385

Comments

cccntu commented Dec 6, 2021

lhoestq commented Dec 14, 2021

cccntu commented Jan 10, 2022 • edited Loading

lhoestq commented Jan 17, 2022

None batched `with_transform`, `set_transform` #3385

None batched `with_transform`, `set_transform` #3385

cccntu commented Jan 10, 2022 •

edited

Loading