You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
Recently, we have been training a large-scale dataset (> 50M audio segments) on the K2 platform.
To reduce IO operations through NFS, we decided to use the Lhotse Shar format.
Traceback (most recent call last):
File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1899, in <module>
main()
File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1890, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1727, in run
train_one_epoch(
File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1282, in train_one_epoch
for batch_idx, batch in enumerate(train_dl):
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
return self._get_iterator()
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
w.start()
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object
It looks like a generator from train_dl is used in the DDP processes.
Then, I tried to debug it and found the only way to make it work was by setting num_workers=0.
I am using shar format with fields of "cuts" and "features". Number of GPU=4.
Has anyone had similar issues?
The text was updated successfully, but these errors were encountered:
You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).
You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).
Thank you very much for your reply!
I saw in the lhotse codebase that if set_dill_enabled(True) was called, it would set an env var "LHOTSE_DILL_ENABLED"=1. And also dill package would be checked whether it was installed. I have not installed dill package.
So I tried to print this var during the training. However, the output was None. So I believe the error was caused by some other issues.
I will try to debug further.
Hi!
Recently, we have been training a large-scale dataset (> 50M audio segments) on the K2 platform.
To reduce IO operations through NFS, we decided to use the Lhotse Shar format.
I followed the instructions from https://github.com/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb
It worked well when I used one GPU with one or multiple workers.
However, I got an unexpected error when I trained the model using multi-GPUs.
Here is the error information:
It looks like a generator from train_dl is used in the DDP processes.
Then, I tried to debug it and found the only way to make it work was by setting num_workers=0.
I am using shar format with fields of "cuts" and "features". Number of GPU=4.
Has anyone had similar issues?
The text was updated successfully, but these errors were encountered: