-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Due to the PR #6498
we are seeing issues on ROCm.
1.
[rank7]: raise child_exception_type(errno_num, err_msg, err_filename)
[rank7]: FileNotFoundError: [Errno 2] No such file or directory: "/opt/rocm/bin/rocminfo | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
[rank0]: Traceback (most recent call last):
Building extension module transformer_inference...
Using envvar MAX_JOBS (32) as the number of workers...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
ninja: error: build.ninja:8: unexpected '='
[rank4]: Traceback (most recent call last):
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
[rank4]: subprocess.run(
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
[rank4]: raise CalledProcessError(retcode, process.args,
[rank4]: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '32']' returned non-zero exit status 1.
Simple code to repro the issue.
import deepspeed
deepspeed.ops.op_builder.InferenceBuilder().load()