-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
A clear and concise description of what the bug is.
deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"
To Reproduce
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots
If applicable, add screenshots to help explain your problem.
Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00, 9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
File "/code/git/learnable-activations/mflow.py", line 429, in <module>
run_experiment(args)
File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
model_engine, optimizer = prepare_deepspeed_model(model, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
model_engine, _, _, _ = deepspeed.initialize(
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
dist.init_distributed(dist_backend=dist_backend,
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
mpi_discovery(distributed_port=distributed_port, verbose=verbose)
File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
result = subprocess.check_output(hostname_cmd, shell=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.
System info (please complete the following information):
- OS: Arch
- GPU count and types x1 7900xtx
- Interconnects (if applicable) one machine
- Python version 3.12
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1
Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.
the offending code:
master_addr = None
if rank == 0:
hostname_cmd = ["hostname -I"]
result = subprocess.check_output(hostname_cmd, shell=True)
master_addr = result.decode('utf-8').split()[0]
master_addr = comm.bcast(master_addr, root=0)