Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterableDataset.from_generator() fails with pickle error when provided a generator or iterator #6118

Open
finkga opened this issue Aug 4, 2023 · 3 comments

Comments

@finkga
Copy link

finkga commented Aug 4, 2023

Describe the bug

Description
Providing a generator in an instantiation of IterableDataset.from_generator() fails with TypeError: cannot pickle 'generator' object when the generator argument is supplied with a generator.

Code example

def line_generator(files: List[Path]):

    if isinstance(files, str):
        files = [Path(files)]

    for file in files:
        if isinstance(file, str):
            file = Path(file)
        yield from open(file,'r').readlines()

...
model_training_files = ['file1.txt', 'file2.txt', 'file3.txt']
train_dataset = IterableDataset.from_generator(generator=line_generator(model_training_files))

Traceback
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/contextlib.py", line 135, in exit
self.gen.throw(type, value, traceback)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 691, in _no_cache_fields
yield
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 701, in dumps
dump(obj, file)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 676, in dump
Pickler(file, recurse=True).dump(obj)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/dill/_dill.py", line 394, in dump
StockPickler.dump(self, obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/pickle.py", line 487, in dump
self.save(obj)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 666, in save
dill.Pickler.save(self, obj, save_persistent_id=save_persistent_id)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/dill/_dill.py", line 388, in save
StockPickler.save(self, obj, save_persistent_id)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/dill/_dill.py", line 1186, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/pickle.py", line 997, in _batch_setitems
save(v)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 666, in save
dill.Pickler.save(self, obj, save_persistent_id=save_persistent_id)
File "/Users/d3p692/code/clem_bert/venv/lib/python3.9/site-packages/dill/_dill.py", line 388, in save
StockPickler.save(self, obj, save_persistent_id)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/pickle.py", line 578, in save
rv = reduce(self.proto)
TypeError: cannot pickle 'generator' object

Steps to reproduce the bug

  1. Create a set of text files to iterate over.
  2. Create a generator that returns the lines in each file until all files are exhausted.
  3. Instantiate the dataset over the generator by instantiating an IterableDataset.from_generator().
  4. Wait for the explosion.

Expected behavior

I would expect that since the function claims to accept a generator that there would be no crash. Instead, I would expect the dataset to return all the lines in the files as queued up in the line_generator() function.

Environment info

datasets.version == '2.13.1'
Python 3.9.6
Platform: Darwin WE35261 22.5.0 Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:22 PDT 2023; root:xnu-8796.121.3~7/RELEASE_X86_64 x86_64

@mariosasko
Copy link
Collaborator

Hi! IterableDataset.from_generator expects a generator function, not the object (to be consistent with Dataset.from_generator).

You can fix the above snippet as follows:

train_dataset = IterableDataset.from_generator(line_generator, fn_kwargs={"files": model_training_files})

@nimrodV81
Copy link

to anyone reaching this issue, the argument is gen_kwargs:

train_dataset = IterableDataset.from_generator(line_generator, gen_kwargs={"files": model_training_files})

@PheelaV
Copy link

PheelaV commented Dec 18, 2024

This still fails, for both Dataset and IterableDataset

  records = [1, 2, 3]

  gen = ({"row": str(x)} for x in records)

  dataset  = IterableDataset.from_generator(generator=gen)

Edit: gen_kwargs must be picklable, it can't be an iterator even if you are not doing multiprocessing, the same goes for included namespace variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants