-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to convert torch.utils.data.Dataset to huggingface dataset? #4983
Comments
Hi! I think you can use the newly-added from datasets import Dataset
def gen():
for idx in len(torch_dataset):
yield torch_dataset[idx] # this has to be a dictionary
## or if it's an IterableDataset
# for ex in torch_dataset:
# yield ex
dset = Dataset.from_generator(gen) |
Maybe from datasets import Dataset
dset = Dataset.from_list(torch_dataset) |
I try to use AttributeError: type object 'Dataset' has no attribute 'from_generator' And I think it maybe the version of my datasets package is out-of-date, so I update it pip install --upgrade datasets But after that, the code still return the above Error. |
It seems that Dataset also has no AttributeError: type object 'Dataset' has no attribute 'from_list' |
My dummy code is like: import os
import json
from torch.utils import data
import datasets
def gen(torch_dataset):
for idx in len(torch_dataset):
yield torch_dataset[idx] # this has to be a dictionary
class MyDataset(data.Dataset):
def __init__(self, path):
self.dict = []
for line in open(path, 'r', encoding='utf-8'):
j_dict = json.loads(line)
self.dict.append(j_dict['context'])
def __getitem__(self, idx):
return self.dict[idx]
def __len__(self):
return len(self.dict)
root_path = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(root_path, 'dataset', 'train.json')
torch_dataset = MyDataset(path)
dit = []
for line in open(path, 'r', encoding='utf-8'):
j_dict = json.loads(line)
dit.append(j_dict['context'])
dset1 = datasets.Dataset.from_list(dit)
print(dset1)
dset2 = datasets.Dataset.from_generator(gen)
print(dset2) |
We're releasing |
Thanks a lot for your work! |
Hi, when I am using this code to build my own dataset, |
Hi ! Right now generator functions are expected to be picklable, so that In the meantime, can you check that you're not using unpickable objects. In your case it looks like you're using a generator object that is unpickable. It might come from an opened file, e.g. this doesn't work: with open(...) as f:
def gen():
for x in f:
yield json.loads(x)
ds = Dataset.from_generator(gen) but this does work: def gen():
with open(...) as f:
for x in f:
yield json.loads(x)
ds = Dataset.from_generator(gen) |
Thanks a lot! That's the reason why I have encountered this issue. Sorry for bothering you again with another problem, since my dataset is large and I use IterableDataset.from_generator which has no attribute with_transform, how can I equip it with some customed preprocessings like Dataset.from_generator? Should I move the preprocessing to the my torch Dataset? |
Iterable datasets are lazy: exactly like Therefore you can use |
@lhoestq thanks a lot and I have successfully made it work~ |
@lhoestq I am having a similar issue. Can you help me understand which kinds of generators are picklable? I previously thought that no generators are picklable so I'm intrigued to hear this. |
Generator functions are generally picklable. E.g. import dill as pickle
def generator_fn():
for i in range(10):
yield i
pickle.dumps(generator_fn) however generators are not picklable generator = generator_fn()
pickle.dumps(generator)
# TypeError: cannot pickle 'generator' object Though it can happen that some generator functions are not recursively picklable if they use global objects that are not picklable: def generator_fn_not_picklable():
for i in generator:
yield i
pickle.dumps(generator_fn_not_picklable, recurse=True)
# TypeError: cannot pickle 'generator' object |
I'm trying to create an IterableDataset from a generator but I get this error: What can I do? |
I look through the huggingface dataset docs, and it seems that there is no offical support function to convert
torch.utils.data.Dataset
to huggingface dataset. However, there is a way to convert huggingface dataset totorch.utils.data.Dataset
, like below:So is there something I miss, or there IS no function to convert
torch.utils.data.Dataset
to huggingface dataset. If so, is there any way to do this convert?Thanks.
The text was updated successfully, but these errors were encountered: