You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.
A typical scenario
Start training on multi GPU
Cancel training with ctrl+C. Make sure nvidia-smi is clean.
Start training again
It fails with
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank7]: self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank7]: shm = SharedMemory(name, True, len(data))
[rank7]: File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank7]: shm = BuiltinSharedMemory(name, create, size)
[rank7]: File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank7]: self._fd = _posixshmem.shm_open(
[rank7]: FileExistsError: [Errno 17] File exists: '/000010_locals'
[rank: 7] Child process with PID 37032 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py
At this point I am not able to do anything. I then try to include
streaming.base.util.clean_stale_shared_memory()
at the beginning of my script, however this leads to even different error
File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 61, in <module>
main()
File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
streaming.base.util.clean_stale_shared_memory()
File "/opt/conda/lib/python3.10/site-packages/streaming/base/util.py", line 176, in clean_stale_shared_memory
destroy_dist = maybe_init_dist()
File "/opt/conda/lib/python3.10/site-packages/streaming/base/distributed.py", line 128, in maybe_init_dist
dist.init_process_group(backend=backend, rank=get_rank(), world_size=get_world_size())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
func_return = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[rank: 1] Child process with PID 47744 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?
Environment
OS: Debian 11
Hardware: H100 GCP
Driver Version: 550.90.07
CUDA Version: 12.4
The text was updated successfully, but these errors were encountered:
@AugustDev thanks for bringing up the issue. It's clearly something related to the share memory.
What launcher do you use?
Can you provide a small script for us to reproduce?
I also wonder if you have any zombie processes after ctr+c, have you tried to pkill python -f before restarting?
I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.
A typical scenario
ctrl+C
. Make surenvidia-smi
is clean.It fails with
At this point I am not able to do anything. I then try to include
at the beginning of my script, however this leads to even different error
We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?
Environment
The text was updated successfully, but these errors were encountered: