Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got Error when use Lhotse Shar data format for multi-GPU K2 model training #1408

Open
pengyizhou opened this issue Oct 22, 2024 · 2 comments

Comments

@pengyizhou
Copy link

Hi!
Recently, we have been training a large-scale dataset (> 50M audio segments) on the K2 platform.
To reduce IO operations through NFS, we decided to use the Lhotse Shar format.

I followed the instructions from https://github.com/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb
It worked well when I used one GPU with one or multiple workers.
However, I got an unexpected error when I trained the model using multi-GPUs.

Here is the error information:

Traceback (most recent call last):
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1899, in <module>
    main()
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1890, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1727, in run
    train_one_epoch(
  File "/home/asrxiv/w2023/projects/zipformer/zipformer/finetune-shar.py", line 1282, in train_one_epoch
    for batch_idx, batch in enumerate(train_dl):
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/asrxiv/anaconda3/envs/icefall/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'generator' object

It looks like a generator from train_dl is used in the DDP processes.
Then, I tried to debug it and found the only way to make it work was by setting num_workers=0.

I am using shar format with fields of "cuts" and "features". Number of GPU=4.
Has anyone had similar issues?

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 22, 2024

You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).

I used this snippet in the past to find the unpicklable objects: https://gist.github.com/pzelasko/90c1c13acd86f6c9c0aa4a3fa69dadba

@pengyizhou
Copy link
Author

You might just get lucky if you set lhotse.set_dill_enabled(True) somewhere in your code (or LHOTSE_DILL_ENABLED=1). Otherwise you'll have to try and debug where this generator object is created. I don't think I've ever run into this issue with Lhotse so I suspect it may be somewhere in the user code (typicall some .map or .filter method called on lhotse objects with a lambda function etc).

I used this snippet in the past to find the unpicklable objects: https://gist.github.com/pzelasko/90c1c13acd86f6c9c0aa4a3fa69dadba

Thank you very much for your reply!
I saw in the lhotse codebase that if set_dill_enabled(True) was called, it would set an env var "LHOTSE_DILL_ENABLED"=1. And also dill package would be checked whether it was installed. I have not installed dill package.
So I tried to print this var during the training. However, the output was None. So I believe the error was caused by some other issues.
I will try to debug further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants