Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ROCM in CI #999

Merged
merged 4 commits into from
Jan 17, 2025
Merged

Enable ROCM in CI #999

merged 4 commits into from
Jan 17, 2025

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Oct 3, 2024

Salient points:

  • Add ROCm nightly wheels to test matrix
  • Refactor conda-builder -> almalinux-builder pytorch#140157
    The above PR shows that we've migrated to almalinux-builder due to the EOL CENTOS 7. Changes to regression_test.yml to not install devtoolset-10 have been made in accordance with this switch.
  • Fix bug in torchao/utils.py in invocation of torch.cuda.get_device_properties()

Needs changes in pytorch/test-infra#6104

Copy link

pytorch-bot bot commented Oct 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/999

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending

As of commit 593fb78 with merge base cedadc7 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2024
Copy link

pytorch-bot bot commented Oct 3, 2024

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@msaroufim
Copy link
Member Author

@atalman im not sure the no-sudo flag does anything. Tried a few variants for the value like true or "true" and same result

@amdfaa
Copy link
Collaborator

amdfaa commented Nov 19, 2024

@pytorchbot rebase

1 similar comment
@amdfaa
Copy link
Collaborator

amdfaa commented Nov 19, 2024

@pytorchbot rebase

docker-image: ${{ matrix.gpu-arch-type == 'rocm' && format('pytorch/manylinux2_28-builder:{0}{1}',
matrix.gpu-arch-type,
matrix.gpu-arch-version)
|| 'pytorch/almalinux-builder' }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch/pytorch#140157
We've migrated to almalinux-builder due to the EOL CENTOS 7.

script: |
conda create -n venv python=3.9 -y
conda activate venv
echo "::group::Install newer objcopy that supports --set-section-alignment"
yum install -y devtoolset-10-binutils
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gcctoolset is installed through the dockerfile. Mentioned in this PR: pytorch/pytorch#140157

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is installed for rocm in this PR: pytorch/pytorch#141609

@msaroufim msaroufim added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 5, 2024
.github/workflows/regression_test.yml Outdated Show resolved Hide resolved
with:
timeout: 120
no-sudo: ${{ matrix.gpu-arch-type == 'rocm' }}
rocm: ${{ matrix.gpu-arch-type == 'rocm' }}
continue-on-error: ${{ matrix.gpu-arch-type == 'rocm' }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely not be checked-in, since it's only for us to gather a complete list of test failures. @msaroufim Would we merge this PR only after ROCm CI is fully clean? I'd rather get all these infra changes merged, so that we run torchao CI on ROCm regularly, and maybe skip any failing tests for ROCm while we work separately to enable them.

Copy link
Member Author

@msaroufim msaroufim Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you, the main constraint is we can't really be having CI per commit or on main run red since then it just causes confusion and people slowly learn to ignore seeing red. So if you'd like to merge some variant of this PR without running on commits to or on main then we can try to merge this more quickly

Personally I'd favor merging the skip tests as part of this work and we can do enablement for tests one by one easily while maintaining a green CI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petrex Please note that torchao team would like to have this PR be merged with a clean signal for ROCm, so please skip any failing tests as part of this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done . #1563 based on the latest ROCm CI run

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Dec 20, 2024
Needed for pytorch/test-infra#6003 and pytorch/ao#999

Pull Request resolved: #143590
Approved by: https://github.com/atalman

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
@petrex
Copy link
Collaborator

petrex commented Jan 6, 2025

happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy?

@jithunnair-amd
Copy link
Collaborator

happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy?

2 pending items:

  1. Resolve issue with aws credentials causing some steps to fail in CI job eg. https://github.com/pytorch/ao/actions/runs/12656214677/job/35268262299#step:17:39 : this one needs pytorch infra team to enable "AWS trust policy" for torchao repository cc @huydhn to submit a PR for that
  2. Skip any failing unit tests for ROCm as part of this PR as per Enable ROCM in CI #999 (comment) cc @petrex to add skips to this PR

@huydhn
Copy link
Contributor

huydhn commented Jan 8, 2025

1. Resolve issue with aws credentials causing some steps to fail in CI job eg. https://github.com/pytorch/ao/actions/runs/12656214677/job/35268262299#step:17:39 : this one needs pytorch infra team to enable "AWS trust policy" for torchao repository cc @huydhn to submit a PR for that

The credential is working now. There is a new failure w.r.t chown on the CI job https://github.com/pytorch/ao/actions/runs/12656214677/job/35334719646, but it’s a different story I think

@amdfaa amdfaa force-pushed the msaroufim-patch-21 branch from f75c6a0 to 17289e7 Compare January 10, 2025 23:44
@petrex
Copy link
Collaborator

petrex commented Jan 14, 2025

@msaroufim I dont seem to have access to this branch so #1563 instead.
@amdfaa please kick off a new ROCm CI run with #1563
thanks!

@petrex
Copy link
Collaborator

petrex commented Jan 14, 2025

happy new year @jithunnair-amd @amdfaa Is this feature/PR ready to deploy?

2 pending items:

  1. Resolve issue with aws credentials causing some steps to fail in CI job eg. https://github.com/pytorch/ao/actions/runs/12656214677/job/35268262299#step:17:39 : this one needs pytorch infra team to enable "AWS trust policy" for torchao repository cc @huydhn to submit a PR for that
  2. Skip any failing unit tests for ROCm as part of this PR as per Enable ROCM in CI #999 (comment) cc @petrex to add skips to this PR

#1563

amdfaa added a commit to pytorch/test-infra that referenced this pull request Jan 16, 2025
@amdfaa amdfaa merged commit d96c6a7 into main Jan 17, 2025
15 of 18 checks passed
@andrewor14
Copy link
Contributor

Hi @amdfaa, looks like this PR is still failing tests when landed. It's causing other unrelated PRs to fail the same tests: https://hud.pytorch.org/pr/pytorch/ao/1580#35788489448. Please make sure the tests are passing before landing.

andrewor14 added a commit that referenced this pull request Jan 17, 2025
andrewor14 added a commit that referenced this pull request Jan 17, 2025
Revert "Enable ROCM in CI (#999)"

This reverts commit d96c6a7.
petrex pushed a commit to petrex/ao that referenced this pull request Jan 22, 2025
* Enable ROCM in CI

---------

Co-authored-by: amdfaa <107946068+amdfaa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: rocm topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants