Skip to content

[BUG] deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i" #6497

@sirus20x6

Description

@sirus20x6

Describe the bug
A clear and concise description of what the bug is.
deepspeed tries to call "hostname -I" which is not a valid flag for hostname. it should be "hostname -i"

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

Processing dataset chunks: 100%|██████████| 106/106 [00:11<00:00,  9.45it/s]
[2024-09-05 04:11:37,288] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.2+c210e601, git-hash=c210e601, git-branch=master
[2024-09-05 04:11:37,288] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-05 04:11:37,288] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
hostname: invalid option -- 'I'
Try 'hostname --help' or 'hostname --usage' for more information.
Traceback (most recent call last):
  File "/code/git/learnable-activations/mflow.py", line 429, in <module>
    run_experiment(args)
  File "/code/git/learnable-activations/mflow.py", line 384, in run_experiment
    model_engine, optimizer = prepare_deepspeed_model(model, args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/git/learnable-activations/mflow.py", line 266, in prepare_deepspeed_model
    model_engine, _, _, _ = deepspeed.initialize(
                            ^^^^^^^^^^^^^^^^^^^^^
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/__init__.py", line 144, in initialize
    dist.init_distributed(dist_backend=dist_backend,
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 673, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/thearray/git/ComfyUI/comfyvenv/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 701, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

System info (please complete the following information):

  • OS: Arch
  • GPU count and types x1 7900xtx
  • Interconnects (if applicable) one machine
  • Python version 3.12
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

#!/bin/bash
export OMPI_MCA_accelerator=rocm
mpirun -np 1 --mca accelerator rocm python mflow.py --deepspeed_config ds_config.json --log_interval 100 --batch_size 4 --local_rank -1

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

the offending code:

master_addr = None
    if rank == 0:
        hostname_cmd = ["hostname -I"]
        result = subprocess.check_output(hostname_cmd, shell=True)
        master_addr = result.decode('utf-8').split()[0]
    master_addr = comm.bcast(master_addr, root=0)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions