-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
None batched with_transform
, set_transform
#3385
Comments
Hi ! Thanks for the suggestion :) Is there something you would like to contribute ? I can give you some pointers if you want |
Hi @lhoestq , I would love to contribute. But I don't know which solution would be the best for this repo.
I agree. What do you think about the alternative solutions?
This won't be able to use torch loader multi-worker.
This is actually pretty simple. import torch
class LazyMapTorchDataset(torch.utils.data.Dataset):
def __init__(self, ds, fn):
self.ds = ds
self.fn = fn
def __getitem__(self, i):
return self.fn(self.ds[i])
d = [{1:2, 2:3}, {1:3, 2:4}]
ds = LazyMapTorchDataset(d, lambda x:{k:v*2 for k,v in x.items()})
for i in range(2):
print(f'before {d[i]}')
print(f'after {ds[i]}')
But this requires converting data to torch tensor myself. And this is really similar to
I think I like this solution best. Because The usage looks nice, too. # lazy, one to one, can be parallelized via torch loader, no need to set `num_worker` beforehand.
dataset = dataset.map(fn, lazy=True, batched=False)
# collate_fn
dataloader = Dataloader(dataset.with_format('torch'), collate_fn=collate_fn, num_workers=...) There are some minor decisions like whether a lazy map should be allowed before another map, but I think we can work it out later. The implementation can probably borrow from |
I like the idea of lazy map. On the other hand we should only have either lazy map or I understand the issue with Finally I think what's also going to be important in the end will be the addition of multiprocessing to transforms |
Is your feature request related to a problem? Please describe.
A
torch.utils.data.Dataset.__getitem__
operates on a single example.But 🤗
Datasets.with_transform
doesn't seem to allow non-batched transform.Describe the solution you'd like
Have a
batched=True
argument inDatasets.with_transform
Describe alternatives you've considered
__getitem__
. 🙄lazy=False
inDataset.map
, and returns aLazyDataset
iflazy=True
. This way the samemap
interface can be used, and existing code can be updated with one argument change.The text was updated successfully, but these errors were encountered: