Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: cannot pickle '_LazyModule' object #12549

Closed
2 tasks
lancekung opened this issue Jul 7, 2021 · 15 comments
Closed
2 tasks

TypeError: cannot pickle '_LazyModule' object #12549

lancekung opened this issue Jul 7, 2021 · 15 comments

Comments

@lancekung
Copy link

lancekung commented Jul 7, 2021

@stas00 edit: please see #12549 (comment) for the short reproduction script.


Environment info

  • transformers version: 4.9.0.dev0
  • Platform: Linux with Nvidia P40
  • Python version: 3.8.0
  • PyTorch version (GPU?): 1.8.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

@stas00 @patrickvonplaten, @LysandreJik

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

  • the official example scripts: (give details below)
  • [√] my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [√] my own task or dataset: (give details below)

To reproduce

I am running the minimal command:

python run_clm.py \
    --model_name_or_path /mycheckpoin/ \
    --train_file train.txt \
    --validation_file  eval.txt \
    --do_train \
    --do_eval \
    --output_dir ./models/ \
    --no_cuda False \
    --fp16 \
    --sharded_ddp simple \
    --num_train_epochs 3.0 \
    --disable_tqdm False \
    --save_steps 100 \
    --preprocessing_num_workers 32 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4

and I modified the following parts of the script ‘run_clm.py’, and the parameter rank passed in training_args.local_rank

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank)

if __name__ == "__main__":
    # main()
    # size = int(os.environ['WORLD_SIZE'])
    size = int(torch.cuda.device_count())
    print(size)
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, main))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

the traceback informations are:

Process Process-2:
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 511, in init_process
    fn(rank, size)
  File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 367, in main
    tokenized_datasets = raw_datasets.map(
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 471, in map
    {
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 472, in <dictcomp>
    k: dataset.map(
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in map
    transformed_shards = [r.get() for r in results]
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 692, in save_reduce
    save(args)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object

I run the following command based on the original script, it works well. The reason why I don't use this command is that our cluster doesn't support this way of passing parameters: "-m torch.distributed.launch --nproc_per_node=4 "

python -m torch.distributed.launch --nproc_per_node=4 run_clm.py \
    --model_name_or_path /mycheckpoin/ \
    --train_file train.txt \
    --validation_file  eval.txt \
    --do_train \
    --do_eval \
    --output_dir ./models/ \
    --no_cuda False \
    --fp16 \
    --sharded_ddp simple \
    --num_train_epochs 3.0 \
    --disable_tqdm False \
    --save_steps 100 \
    --preprocessing_num_workers 32 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4

Expected behavior

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

Could you please attach the final script you used or a branch that we can use to reproduce your code exactly? Thanks.

note: I took the liberty to edit your OP to use code formatting which is much easier to read. If possible use a similar approach in future reports. Thank you!

@lancekung
Copy link
Author

Could you please attach the final script you used or a branch that we can use to reproduce your code exactly? Thanks.

note: I took the liberty to edit your OP to use code formatting which is much easier to read. If possible use a similar approach in future reports. Thank you!

this is my scripts, thanks very much!
run_clm.py.zip

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

Thank you. The attached script fails for me. You also didn't supply the data, but I assume it doesn't matter. In the future please supply everything or adapt your runtime so that we could run it out of the box and not need to spend a lot of time to try to make things work.

python run_clm.py \
>     --model_name_or_path gpt2 \
>     --dataset_name wikitext \
>     --dataset_config_name wikitext-2-raw-v1 \
>     --do_train \
>     --do_eval \
>     --output_dir /tmp/test-clm
2021-07-06 21:18:15.064178: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2
2021-07-06 21:18:17.425481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-06 21:18:17.425484: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Process Process-1:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
TypeError: init_process() missing 1 required positional argument: 'fn'
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
TypeError: init_process() missing 1 required positional argument: 'fn'

same failure with distributed.

@lancekung
Copy link
Author

Thank you. The attached script fails for me. You also didn't supply the data, but I assume it doesn't matter. In the future please supply everything or adapt your runtime so that we could run it out of the box and not need to spend a lot of time to try to make things work.

So sorry, it's my fault, I gave you the wrong version.
This is the right version.
run_clm.py.zip

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

I'm able to reproduce the problem - great!

Let's see what the culprit is.

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

So the trigger is: --preprocessing_num_workers 32

and the minimal reproduction cmd is:

python run_clm.py --model_name_or_path sshleifer/tiny-gpt2 --dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir /tmp/test-clm \
--overwrite_output_dir --preprocessing_num_workers 32

It happens only with your version of the script. I tested with the one in master it works fine there.

The problem is unrelated to the change in #11168 as you have discovered yourself, since your code removed my changes and you're just passing:

    def tokenize_function1(examples):
        return tokenizer(examples[text_column_name])

So need to look elsewhere for the cause.

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

From a quick look I suspect that perhaps this is an issue in datasets when num_proc > 1? Could you try to reduce the script to the bare minimum, so that it runs just:

    with training_args.main_process_first(desc="dataset map tokenization"):
        tokenized_datasets = raw_datasets.map(
            None,
            num_proc=5,
        )

inside the multi-proc modifications you made.

e.g. the above is enough to trigger the same error in the script so removing most of the code should

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

OK, here is the minimal reproducible script. Totally unrelated to transformers it seems except for the import of transformers

import logging
import math
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

import datasets
from datasets import load_dataset

import transformers

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def main(rank, size):

    def tokenize_function(examples):
        return None

    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        num_proc=32,
    )

def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    # main()
    # size = int(os.environ['WORLD_SIZE'])
    size = int(torch.cuda.device_count())
    print(size)
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, main))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

this still fails with the same error.

python run_clm.py
2
Reusing dataset wikitext (/home/stas/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Reusing dataset wikitext (/home/stas/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Process Process-1:
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 60, in init_process
    fn(rank, size)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 46, in main
    tokenized_datasets = raw_datasets.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 471, in map
    {
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 472, in <dictcomp>
    k: dataset.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in map
    transformed_shards = [r.get() for r in results]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 60, in init_process
    fn(rank, size)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 46, in main
    tokenized_datasets = raw_datasets.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 471, in map
    {
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 472, in <dictcomp>
    k: dataset.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in map
    transformed_shards = [r.get() for r in results]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object

But if you either:

  • comment out import transformers
  • or set num_proc=1 in datasets.map (instead of n>1)
    all is good.

@lhoestq, @albertvillanova - does this ring any bells? Clearly transformers loads some module lazily and trips up datasets even though transformers isn't really used here directly. Thank you.

@lancekung
Copy link
Author

OK, here is the minimal reproducible script. Totally unrelated to transformers it seems except for the import of transformers

import logging
import math
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

import datasets
from datasets import load_dataset

import transformers

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def main(rank, size):

    def tokenize_function(examples):
        return None

    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        num_proc=32,
    )

def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    # main()
    # size = int(os.environ['WORLD_SIZE'])
    size = int(torch.cuda.device_count())
    print(size)
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, main))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

this still fails with the same error.

python run_clm.py
2
Reusing dataset wikitext (/home/stas/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Reusing dataset wikitext (/home/stas/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
Process Process-1:
Process Process-2:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 60, in init_process
    fn(rank, size)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 46, in main
    tokenized_datasets = raw_datasets.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 471, in map
    {
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 472, in <dictcomp>
    k: dataset.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in map
    transformed_shards = [r.get() for r in results]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 60, in init_process
    fn(rank, size)
  File "/mnt/nvme1/code/huggingface/users/lancekung/run_clm.py", line 46, in main
    tokenized_datasets = raw_datasets.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 471, in map
    {
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/dataset_dict.py", line 472, in <dictcomp>
    k: dataset.map(
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in map
    transformed_shards = [r.get() for r in results]
  File "/mnt/nvme1/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 1736, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 498, in dump
    StockPickler.dump(self, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 1493, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 692, in save_reduce
    save(args)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 901, in save_tuple
    save(element)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/pickle.py", line 578, in save
    rv = reduce(self.proto)
TypeError: cannot pickle '_LazyModule' object

But if you either:

  • comment out import transformers
  • or set num_proc=1 in datasets.map (instead of n>1)
    all is good.

@lhoestq, @albertvillanova - does this ring any bells? Clearly transformers loads some module lazily and trips up datasets even though transformers isn't really used here directly. Thank you.

Thank you so much for your time, and hope other experts can give some tips about this problem.

@albertvillanova
Copy link
Member

albertvillanova commented Jul 7, 2021

Hi @stas00, thanks for pinging.

I'm having a look and after a first search, I think you are right and the problem comes from the fact that transformers makes a lazy import when importing it. I guess this affects datasets here: https://github.com/huggingface/datasets/blob/master/src/datasets/utils/py_utils.py#L319 (PR: huggingface/datasets#502), which is used by dumps to pickle objects in a multiprocessing setup.

cc: @lhoestq

@lancekung
Copy link
Author

lancekung commented Jul 7, 2021

Hi @stas00, thanks for pinging.

I'm having a look and after a first search, I think you are right and the problem comes from the fact that transformers makes a lazy import when importing it. I guess this affects datasets here: https://github.com/huggingface/datasets/blob/master/src/datasets/utils/py_utils.py#L319 (PR: huggingface/datasets#502), which is used by dumps to pickle objects in a multiprocessing setup.

cc: @lhoestq

hi albertvillanova, I removed import of transformers according to the following code, it still can't work.

def _no_cache_fields(obj): try: if ( "PreTrainedTokenizerBase" in [base_class.__name__ for base_class in type(obj).__mro__] and hasattr(obj, "cache") and isinstance(obj.cache, dict) )

@lhoestq
Copy link
Member

lhoestq commented Jul 7, 2021

Note that we can easily make _LazyModule picklable. I can open a PR if needed to implement a __reduce__ method for _LazyModule. It's the only object that prevents transformers from being picklable.

EDIT: here it is: #12552

This is just a way to easily fix this issue, but I think we should definitely keep trying to figure out why it tried to pickle transformers in the first place. This might come from dill that pickles the globals of some environments when pickling any object

@stas00
Copy link
Contributor

stas00 commented Jul 7, 2021

Linking to the new PR: #12567

@sgugger
Copy link
Collaborator

sgugger commented Jul 8, 2021

Should be closed by #12567, please let us know if the problem persists.

@sgugger sgugger closed this as completed Jul 8, 2021
@lancekung
Copy link
Author

Should be closed by #12567, please let us know if the problem persists.

Hi, a new problem has arisen
we can pickle "LazyModule" now, but can't pickle <class 'types.AutoModelForCausalLM'>

Traceback (most recent call last):
File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/anaconda3/envs/py38/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 509, in init_process
fn(rank, size)
File "/media/cfs/gonglixing/9Nctl/gpt_v2/run_clm_v3.py", line 367, in main
tokenized_datasets = raw_datasets.map(
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 471, in map
{
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/dataset_dict.py", line 472, in
k: dataset.map(
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in map
transformed_shards = [r.get() for r in results]
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1736, in
transformed_shards = [r.get() for r in results]
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
put(task)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/connection.py", line 209, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/multiprocess/reduction.py", line 54, in dumps
cls(buf, protocol, *args, **kwds).dump(obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py",line 498, in dump
StockPickler.dump(self, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 487, in dump
self.save(obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
save(element)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py",line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py",line 1493, in save_function
pickler.save_reduce(_create_function, (obj.code,
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 692, in save_reduce
save(args)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 901, in save_tuple
save(element)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py",line 990, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 971, in save_dict
self._batch_setitems(obj.items())
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 997, in _batch_setitems
save(v)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 560, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/local/anaconda3/envs/py38/lib/python3.8/site-packages/dill/_dill.py",line 1439, in save_type
StockPickler.save_global(pickler, obj, name=name)
File "/usr/local/anaconda3/envs/py38/lib/python3.8/pickle.py", line 1070, in save_global
raise PicklingError(
_pickle.PicklingError: Can't pickle <class 'types.AutoModelForCausalLM'>: it's notfound as types.AutoModelForCausalLM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants