Dataset does not work after stopping training #781

AugustDev · 2024-09-15T21:49:11Z

I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.

A typical scenario

Start training on multi GPU
Cancel training with ctrl+C. Make sure nvidia-smi is clean.
Start training again

It fails with

[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank7]:     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank7]:     shm = SharedMemory(name, True, len(data))
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank7]:     shm = BuiltinSharedMemory(name, create, size)
[rank7]:   File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank7]:     self._fd = _posixshmem.shm_open(
[rank7]: FileExistsError: [Errno 17] File exists: '/000010_locals'
[rank: 7] Child process with PID 37032 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py

At this point I am not able to do anything. I then try to include

streaming.base.util.clean_stale_shared_memory()

at the beginning of my script, however this leads to even different error

  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 61, in <module>
    main()
  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
    streaming.base.util.clean_stale_shared_memory()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/util.py", line 176, in clean_stale_shared_memory
    destroy_dist = maybe_init_dist()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/distributed.py", line 128, in maybe_init_dist
    dist.init_process_group(backend=backend, rank=get_rank(), world_size=get_world_size())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[rank: 1] Child process with PID 47744 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?

Environment

OS: Debian 11
Hardware: H100 GCP
Driver Version: 550.90.07
CUDA Version: 12.4

The text was updated successfully, but these errors were encountered:

XiaohanZhangCMU · 2024-09-15T23:50:38Z

@AugustDev thanks for bringing up the issue. It's clearly something related to the share memory.
What launcher do you use?
Can you provide a small script for us to reproduce?
I also wonder if you have any zombie processes after ctr+c, have you tried to pkill python -f before restarting?

AugustDev added the bug Something isn't working label Sep 15, 2024

snarayan21 mentioned this issue Sep 16, 2024

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning #767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset does not work after stopping training #781

Dataset does not work after stopping training #781

AugustDev commented Sep 15, 2024

XiaohanZhangCMU commented Sep 15, 2024

Dataset does not work after stopping training #781

Dataset does not work after stopping training #781

Comments

AugustDev commented Sep 15, 2024

Environment

XiaohanZhangCMU commented Sep 15, 2024