-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] subprocess.CalledProcessError #6585
Comments
ping @tjruwase |
shlex.split() is not working as expected for the strings. (rocm_wavefront_size_cmd, rocm_gpu_arch_cmd)
Note that, safe_cmd is not used in get_rocm_wavefront_size(). (in the checked-in code) |
@jagadish-amd, thanks! This fix looks good since there is no user-input involved in the cmds. |
@jagadish-amd, can you please share a PR with this fix? |
jagadish-amd
added a commit
to jagadish-amd/DeepSpeed
that referenced
this issue
Sep 27, 2024
Fixes microsoft#6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
jagadish-amd
added a commit
to ROCm/DeepSpeed
that referenced
this issue
Oct 4, 2024
Fixes microsoft#6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Due to the PR #6498
we are seeing issues on ROCm.
1.
[rank7]: raise child_exception_type(errno_num, err_msg, err_filename)
[rank7]: FileNotFoundError: [Errno 2] No such file or directory: "/opt/rocm/bin/rocminfo | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
[rank0]: Traceback (most recent call last):
Building extension module transformer_inference...
Using envvar MAX_JOBS (32) as the number of workers...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
ninja: error: build.ninja:8: unexpected '='
[rank4]: Traceback (most recent call last):
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
[rank4]: subprocess.run(
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
[rank4]: raise CalledProcessError(retcode, process.args,
[rank4]: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '32']' returned non-zero exit status 1.
Simple code to repro the issue.
import deepspeed
deepspeed.ops.op_builder.InferenceBuilder().load()
@loadams @jithunnair-amd @pruthvistony @jeffdaily
The text was updated successfully, but these errors were encountered: