Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset does not work after stopping training #781

Open
AugustDev opened this issue Sep 15, 2024 · 1 comment
Open

Dataset does not work after stopping training #781

AugustDev opened this issue Sep 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@AugustDev
Copy link

I am trying to use Streaming dataset and Streaming data loader, but it almost never works after crash/stopped training.

A typical scenario

  1. Start training on multi GPU
  2. Cancel training with ctrl+C. Make sure nvidia-smi is clean.
  3. Start training again

It fails with

[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/dataset.py", line 529, in __init__
[rank7]:     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[rank7]:     shm = SharedMemory(name, True, len(data))
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[rank7]:     shm = BuiltinSharedMemory(name, create, size)
[rank7]:   File "/opt/conda/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[rank7]:     self._fd = _posixshmem.shm_open(
[rank7]: FileExistsError: [Errno 17] File exists: '/000010_locals'
[rank: 7] Child process with PID 37032 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py

At this point I am not able to do anything. I then try to include

streaming.base.util.clean_stale_shared_memory()

at the beginning of my script, however this leads to even different error

  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 61, in <module>
    main()
  File "/home/august/cfdx/ai/dna_fm/lightning_gym.py", line 55, in main
    streaming.base.util.clean_stale_shared_memory()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/util.py", line 176, in clean_stale_shared_memory
    destroy_dist = maybe_init_dist()
  File "/opt/conda/lib/python3.10/site-packages/streaming/base/distributed.py", line 128, in maybe_init_dist
    dist.init_process_group(backend=backend, rank=get_rank(), world_size=get_world_size())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
[rank: 1] Child process with PID 47744 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

We've been heavily relying to be able to migrate to Mosaic, however we just can't use it productively. Any ideas what could be wrong?

Environment

  • OS: Debian 11
  • Hardware: H100 GCP
  • Driver Version: 550.90.07
  • CUDA Version: 12.4
@AugustDev AugustDev added the bug Something isn't working label Sep 15, 2024
@XiaohanZhangCMU
Copy link
Member

@AugustDev thanks for bringing up the issue. It's clearly something related to the share memory.
What launcher do you use?
Can you provide a small script for us to reproduce?
I also wonder if you have any zombie processes after ctr+c, have you tried to pkill python -f before restarting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants