Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD][BACKEND] Switch to code object v5 #5005

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

AlexAUT
Copy link
Contributor

@AlexAUT AlexAUT commented Oct 29, 2024

Switches to code object v5 which requires to bump rocm to 6.2+ to avoid segfaults for device_prints and tl.num_programs.
The added unit test covers the previous segfault with device_prints for 2d tensors.

This is a preparation to test llvm/llvm-project@c4d8920

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@@ -311,7 +311,7 @@ jobs:
runner: ${{fromJson(needs.Runner-Preparation.outputs.matrix-HIP)}}
name: Integration-Tests (${{matrix.runner[1] == 'gfx90a' && 'mi210' || 'mi300x'}})
container:
image: rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.4
image: rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find rocm 6.2 images with pytorch 2.4. Maybe someone could point me to them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need pytorch 2.4?
Previously we need pytorch 2.3 (or 2.2) for fp8 support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall some interpreter tests failed due to old pytorch version (which didn't support some types). I cannot recall the exact details though.

@AlexAUT AlexAUT force-pushed the switchToCOV5 branch 4 times, most recently from 8cf7619 to d5b5d81 Compare November 6, 2024 12:53
@@ -234,5 +234,5 @@ Supported Platforms:
Supported Hardware:

- NVIDIA GPUs (Compute Capability 8.0+)
- AMD GPUs (ROCm 5.2+)
- AMD GPUs (ROCm 6.2+)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rocm 5.x is no longer supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the switch to cov5 test_optimize_thread_locality and some device_prints (the new test) will cause segfaults on 6.1 and 6.0, I haven't tested 5.x versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to chat a bit regarding when to land this then, given it's a breaking change.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out the docker issue! We need to separate this pull request into two parts--one updating the docker, the other to follow to switch to COv5.

@AlexAUT AlexAUT mentioned this pull request Nov 22, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants