Skip to content

[BUG] Deepspeed installation issue on ROCm #6599

@itej89

Description

@itej89

I am facing following errors while building main on ROCm with ops

Bug:

cd Deepspeed
DS_BUILD_FUSED_ADAM=1 pip install . 
Processing /myworkspace/DeepSpeed
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      [2024-10-04 18:00:35,186] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      [2024-10-04 18:00:35,821] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/myworkspace/DeepSpeed/setup.py", line 200, in <module>
          ext_modules.append(builder.builder())
        File "/myworkspace/DeepSpeed/op_builder/builder.py", line 711, in builder
          compile_args['cxx'].append('-DROCM_WAVEFRONT_SIZE=%s' % self.get_rocm_wavefront_size())
        File "/myworkspace/DeepSpeed/op_builder/builder.py", line 276, in get_rocm_wavefront_size
          result = subprocess.check_output(rocm_wavefront_size_cmd)
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 424, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 505, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 951, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 1837, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: "/opt/rocm/bin/rocminfo | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
      DS_BUILD_OPS=0
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

This is caused by the removal of "shell=True" in the below security fix
659f6be

Resolution:
Propose to update the script to use subprocess.run command instead of subprocess.check_out

    @staticmethod
    def get_rocm_wavefront_size():
        if OpBuilder._rocm_wavefront_size:
            return OpBuilder._rocm_wavefront_size

        rocm_info = Path("/opt/rocm/bin/rocminfo")
        if not rocm_info.is_file():
            rocm_info = Path("rocminfo")

        # Construct the command as a list of arguments
        grep_cmd = [
            str(rocm_info), 
            "|", 
            "grep", "-Eo", "-m1", "Wavefront Size:[[:space:]]+[0-9]+", 
            "|", 
            "grep", "-Eo", "[0-9]+"
        ]
        
        try:
            # Run the command using subprocess.run
            result = subprocess.run(grep_cmd, capture_output=True)
            rocm_wavefront_size = result.stdout.strip()
        except subprocess.CalledProcessError:
            rocm_wavefront_size = "32"
        
        OpBuilder._rocm_wavefront_size = rocm_wavefront_size
        return OpBuilder._rocm_wavefront_size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions