Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

None batched with_transform, set_transform #3385

Open
cccntu opened this issue Dec 6, 2021 · 3 comments
Open

None batched with_transform, set_transform #3385

cccntu opened this issue Dec 6, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@cccntu
Copy link
Contributor

cccntu commented Dec 6, 2021

Is your feature request related to a problem? Please describe.

A torch.utils.data.Dataset.__getitem__ operates on a single example.
But 🤗 Datasets.with_transform doesn't seem to allow non-batched transform.

Describe the solution you'd like

Have a batched=True argument in Datasets.with_transform

Describe alternatives you've considered

  • Convert a non-batched transform function to batched one myself.
  • Wrap a 🤗 Dataset with torch Dataset, and add a __getitem__. 🙄
  • Have lazy=False in Dataset.map, and returns a LazyDataset if lazy=True. This way the same map interface can be used, and existing code can be updated with one argument change.
@cccntu cccntu added the enhancement New feature or request label Dec 6, 2021
@lhoestq
Copy link
Member

lhoestq commented Dec 14, 2021

Hi ! Thanks for the suggestion :)
It makes sense to me, and it can surely be implemented by wrapping the user's function to make it a batched function. However I'm not a big fan of the inconsistency it would create with map: with_transform is batched by default while map isn't.

Is there something you would like to contribute ? I can give you some pointers if you want

@cccntu
Copy link
Contributor Author

cccntu commented Jan 10, 2022

Hi @lhoestq ,
Sorry I missed your reply.

I would love to contribute. But I don't know which solution would be the best for this repo.

However I'm not a big fan of the inconsistency it would create with map: with_transform is batched by default while map isn't.

I agree. What do you think about the alternative solutions?

  • Convert a non-batched transform function to batched one myself.

This won't be able to use torch loader multi-worker.

  • Wrap a 🤗 Dataset with torch Dataset, and add a getitem. 🙄

This is actually pretty simple.

import torch

class LazyMapTorchDataset(torch.utils.data.Dataset):
    def __init__(self, ds, fn):
        self.ds = ds
        self.fn = fn
    def __getitem__(self, i):
        return self.fn(self.ds[i])

d = [{1:2, 2:3}, {1:3, 2:4}]
ds = LazyMapTorchDataset(d, lambda x:{k:v*2 for k,v in x.items()})
for i in range(2):
    print(f'before {d[i]}')
    print(f'after {ds[i]}')
before {1: 2, 2: 3}
after {1: 4, 2: 6}
before {1: 3, 2: 4}
after {1: 6, 2: 8}

But this requires converting data to torch tensor myself. And this is really similar to .map(), why not just use it? So I have the next solution.

  • Have lazy=False in Dataset.map, and returns a LazyDataset if lazy=True. This way the same map interface can be used, and existing code can be updated with one argument change.

I think I like this solution best. Because .with_transform is entangled with .with_format, so seems more flexible to modify the .map than to modify .with_transform.

The usage looks nice, too.

# lazy, one to one, can be parallelized via torch loader, no need to set `num_worker` beforehand.
dataset = dataset.map(fn, lazy=True, batched=False)
# collate_fn
dataloader = Dataloader(dataset.with_format('torch'), collate_fn=collate_fn, num_workers=...) 

There are some minor decisions like whether a lazy map should be allowed before another map, but I think we can work it out later. The implementation can probably borrow from IterableDataset.

@lhoestq
Copy link
Member

lhoestq commented Jan 17, 2022

I like the idea of lazy map. On the other hand we should only have either lazy map or with_transform (not both). That's why I'd rather stick with with_transform for now (but maybe we can consider it for later major releases like datasets v2).

I understand the issue with with_transform and with_format being exclusive, maybe we can separate them: first transform, them format.

Finally I think what's also going to be important in the end will be the addition of multiprocessing to transforms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants