-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ROCM in CI #999
Enable ROCM in CI #999
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/999
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 PendingAs of commit 593fb78 with merge base cedadc7 (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
No ciflow labels are configured for this repo. |
@atalman im not sure the no-sudo flag does anything. Tried a few variants for the value like true or "true" and same result |
@pytorchbot rebase |
1 similar comment
@pytorchbot rebase |
docker-image: ${{ matrix.gpu-arch-type == 'rocm' && format('pytorch/manylinux2_28-builder:{0}{1}', | ||
matrix.gpu-arch-type, | ||
matrix.gpu-arch-version) | ||
|| 'pytorch/almalinux-builder' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch/pytorch#140157
We've migrated to almalinux-builder due to the EOL CENTOS 7.
script: | | ||
conda create -n venv python=3.9 -y | ||
conda activate venv | ||
echo "::group::Install newer objcopy that supports --set-section-alignment" | ||
yum install -y devtoolset-10-binutils | ||
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcctoolset is installed through the dockerfile. Mentioned in this PR: pytorch/pytorch#140157
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is installed for rocm in this PR: pytorch/pytorch#141609
with: | ||
timeout: 120 | ||
no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
rocm: ${{ matrix.gpu-arch-type == 'rocm' }} | ||
continue-on-error: ${{ matrix.gpu-arch-type == 'rocm' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely not be checked-in, since it's only for us to gather a complete list of test failures. @msaroufim Would we merge this PR only after ROCm CI is fully clean? I'd rather get all these infra changes merged, so that we run torchao CI on ROCm regularly, and maybe skip any failing tests for ROCm while we work separately to enable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you, the main constraint is we can't really be having CI per commit or on main run red since then it just causes confusion and people slowly learn to ignore seeing red. So if you'd like to merge some variant of this PR without running on commits to or on main then we can try to merge this more quickly
Personally I'd favor merging the skip tests as part of this work and we can do enablement for tests one by one easily while maintaining a green CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petrex Please note that torchao team would like to have this PR be merged with a clean signal for ROCm, so please skip any failing tests as part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done . #1563 based on the latest ROCm CI run
Needed for pytorch/test-infra#6003 and pytorch/ao#999 Pull Request resolved: #143590 Approved by: https://github.com/atalman Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy? |
2 pending items:
|
The credential is working now. There is a new failure w.r.t chown on the CI job https://github.com/pytorch/ao/actions/runs/12656214677/job/35334719646, but it’s a different story I think |
f75c6a0
to
17289e7
Compare
17289e7
to
cb1331d
Compare
@msaroufim I dont seem to have access to this branch so #1563 instead. |
|
Hi @amdfaa, looks like this PR is still failing tests when landed. It's causing other unrelated PRs to fail the same tests: https://hud.pytorch.org/pr/pytorch/ao/1580#35788489448. Please make sure the tests are passing before landing. |
This reverts commit d96c6a7.
* Enable ROCM in CI --------- Co-authored-by: amdfaa <107946068+amdfaa@users.noreply.github.com>
Salient points:
The above PR shows that we've migrated to almalinux-builder due to the EOL CENTOS 7. Changes to regression_test.yml to not install devtoolset-10 have been made in accordance with this switch.
torchao/utils.py
in invocation oftorch.cuda.get_device_properties()
Needs changes in pytorch/test-infra#6104