ROCm Sparse Marlin Kernels #1206

petrex · 2024-10-31T21:44:15Z

Built on top pf #1201. This pull request introduces support for ROCm (Radeon Open Compute) for sparse marling kernel in addition to CUDA, enabling the code to run on AMD GPUs.

The main changes involve conditional compilation to handle differences between CUDA and ROCm, as well as adding ROCm-specific intrinsics for MI300x.

co-author : @lcskrishna

Key changes include:

ROCm Support in `setup.py`:

hip kernels generation

Conditional Compilation in CUDA Source Files:

Added conditional compilation directives to exclude certain code for ROCm and include ROCm-specific implementations.

ROCm-specific Implementations:

Implemented ROCm-specific versions of functions and macros that are different from their CUDA counterparts, ensuring compatibility and performance on AMD GPUs.

validation and benchmark across workloads on MIxxx GPUs

ROCm build infrastructure

[ROCm] Enable Tiled layout extension and minor changes to setup

pytorch-bot · 2024-10-31T21:44:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 2 Unrelated Failures

As of commit 8124a58 with merge base 883dc65 ():

NEW FAILURES - The following jobs have failed:

Build M1 Wheels / pytorch/ao / upload / wheel-py3_9-cpu (gh)
The process '/usr/bin/git' failed with exit code 1
Code Analysis with Ruff / build (3.9) (gh)
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh)
test/dtypes/test_affine_quantized.py::TestAffineQuantizedBasic::test_flatten_unflatten_device_cuda_bfloat16
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch==2.7.0.dev20250122 --index-... / linux-job (gh)
test/quantization/test_quant_api.py::TestQuantFlow::test_workflow_e2e_numerics_config4

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run TorchAO Experimental Tests / test-cpu-ops (macos-14) (gh) (trunk failure)
Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh) (trunk failure)
AttributeError: '_OpNamespace' 'torchao' object has no attribute '_linear_fp_act_1bit_weight'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

setup.py

torchao/csrc/cuda/sparse_marlin/mem.h

msaroufim · 2024-11-04T20:32:26Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

petrex · 2024-11-05T17:17:32Z

Do you have performance numbers by any chance relative to fp16? wanna make sure the performance improvements are competitive with CUDA

still WIP, but would you share the benchmark you guys are using? will try that on mi300x when the PR is ready.

msaroufim · 2024-11-05T19:10:06Z

Ok holler at me again whenever you need a review. Really excited to see this land

drisspg · 2024-11-05T23:06:03Z

For benchmarking it is a little ad hoc the best place for this today would be to verify on: https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py

Fixes builds for non-rocm.

pytorch-bot · 2025-01-07T18:03:51Z

Unknown label ciflow/rocm.
Currently recognized labels are

ciflow/benchmark

msaroufim

seems good to me, I'll lean on @atalman and @jcaip for the final merge since the error you're seeing in CI does seem like an underlying infra issue. It's not a flake though, I tried rerunning it and it still fails

jcaip

Looks good! Did you get a chance to try this and get benchmarking numbers? Curious to see how it compares. We should probably update the testing framework too for AMD

jcaip · 2025-01-09T20:35:09Z

torchao/csrc/cuda/sparse_marlin/mem.h

@@ -19,6 +19,28 @@
 #include "base.h"

 namespace torchao {
+
+#ifdef USE_ROCM


Should we gate on a specific ROCm version like we do for CUDA?

Good point! What we need is a GPU arch check instead of ROCm version check. I have added a GPU architecture check in the setup.py . As a result, the kernel will now only be built for the MI300X architecture.

Sounds good, I think setup.py was recently updated by #1490, so you may have to pull in the new changes.

jcaip · 2025-01-09T20:47:23Z

torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu

+#if defined(USE_ROCM)
+#if ROCM_VERSION >= 60200
+  auto BF16_BIAS = __bfloat162bfloat162(__hip_bfloat16(__hip_bfloat16_raw{0xC308}));
+  auto BF16_ONE = __bfloat162bfloat162(__hip_bfloat16(__hip_bfloat16_raw{0x3F80}));


what does B16_ONE refer to here?

thanks let me clean up a little bit. I'd like this PR focus on sparse_marlin. tensor_core_tile_layout.cu should go to #1201 instead.

0x3F80 in BF16 : Sign bit (0) + Exponent (111111) + Mantissa (00000000) = 1.0
Just renamed it in #1201 to reflect this.
see : 26fa19c

petrex · 2025-01-09T22:16:49Z

Looks good! Did you get a chance to try this and get benchmarking numbers? Curious to see how it compares. We should probably update the testing framework too for AMD

thanks It is planned. I will update the benchmark PR.

jcaip

LGTM, should be good to merge once we fix the setup.py conflicts.

refactor for better readibility

petrex · 2025-01-16T01:20:27Z

LGTM, should be good to merge once we fix the setup.py conflicts.

done.

* enable build for rocm for fp6_llm * enable tiled layout extension * fix build error related to option * require rocm 6.2 * enable tensor tiled layout extension with successful compilation * clean-up * fix potential memory access issue * fix __nv_bfloat162 init * add comment for MI300x isa * fix build for non-rocm * add sparse_marlin kernel to the build * drop .h from conversion * cp_asyc4_pred_zfill() AMD implementation * implement matching mem utility with amd GCN isa * implement mma util with amd gcn isa * enable rocm path * update copy from global to lds * implement cvta_to_shared() * consolidate code with cvta_to_shared() * lint * add GPU arch check for MI300x * revert change in tensor_core_tile_layout.cu * lint refactor for better readibility * fix setup --------- Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: Peter Yeh <petrex@users.noreply.github.com> Co-authored-by: Peter Y. Yeh <pyeh@amd.com>

lcskrishna and others added 10 commits October 16, 2024 05:19

enable build for rocm for fp6_llm

6d92e40

Merge pull request #1 from lcskrishna/cl/rocm-enablement

14b3fce

ROCm build infrastructure

enable tiled layout extension

f1a22cf

fix build error related to option

0bef6ca

require rocm 6.2

893ae03

enable tensor tiled layout extension with successful compilation

a0d3788

enable successful build

e4e654d

clean-up

3e2c6a1

Merge pull request #3 from lcskrishna/csrikris_enable_tensor_tile

c86880e

[ROCm] Enable Tiled layout extension and minor changes to setup

fix potential memory access issue

91d3c75

pytorch-bot bot added the module: rocm label Oct 31, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2024

msaroufim requested review from msaroufim and removed request for msaroufim November 2, 2024 22:51

msaroufim reviewed Nov 4, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

msaroufim reviewed Nov 4, 2024

View reviewed changes

torchao/csrc/cuda/sparse_marlin/mem.h Show resolved Hide resolved

jcaip mentioned this pull request Nov 11, 2024

AMD integration tracker #1260

Open

3 tasks

petrex and others added 8 commits November 12, 2024 16:17

fix __nv_bfloat162 init

38b7d1c

add comment for MI300x isa

279f4b3

Merge branch 'main' into rocm_enablement_staging

612ad14

fix build for non-rocm

bbf5a72

Merge pull request #4 from lcskrishna/rocm_enablement

735570e

Fixes builds for non-rocm.

Merge branch 'main' into rocm_enablement_staging

253c188

add sparse_marlin kernel to the build

a2f1736

drop .h from conversion

f817edf

pytorch-bot bot added the ciflow/rocm label Jan 7, 2025

petrex requested a review from msaroufim January 8, 2025 15:59

msaroufim approved these changes Jan 8, 2025

View reviewed changes

Merge branch 'main' into rocm_sparse_marlin

15974c7

petrex requested review from jcaip and atalman January 8, 2025 20:18

lint

a4e8c30

petrex force-pushed the rocm_sparse_marlin branch from 662bfe7 to a4e8c30 Compare January 8, 2025 23:06

jcaip reviewed Jan 9, 2025

View reviewed changes

add GPU arch check for MI300x

c678cb0

revert change in tensor_core_tile_layout.cu

08d1cfb

petrex force-pushed the rocm_sparse_marlin branch from 1f3b773 to 08d1cfb Compare January 9, 2025 22:34

petrex requested a review from jcaip January 10, 2025 00:25

jcaip approved these changes Jan 14, 2025

View reviewed changes

petrex and others added 2 commits January 15, 2025 15:03

Merge branch 'main' into rocm_sparse_marlin

b96196b

lint

aea9d81

refactor for better readibility

petrex force-pushed the rocm_sparse_marlin branch from 3185c9d to aea9d81 Compare January 15, 2025 23:52

Merge branch 'main' into rocm_sparse_marlin

f18043d

jcaip mentioned this pull request Feb 28, 2025

ROCm Support : Tile_Layout kernel #1201

Merged

2 tasks

Merge branch 'main' into rocm_sparse_marlin

8b34390

pytorch-bot bot removed the ciflow/rocm label Mar 4, 2025

Merge branch 'main' into rocm_sparse_marlin

75b6816

jcaip mentioned this pull request Mar 4, 2025

ROCm Sparse Marlin Kernels #1206 #1834

Merged

1 task

lint

8124a58

metascroy mentioned this pull request Mar 5, 2025

Fix CI #1837

Merged

jcaip closed this Mar 5, 2025

ROCm Sparse Marlin Kernels #1206

ROCm Sparse Marlin Kernels #1206

Uh oh!

Conversation

petrex commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ROCm Support in setup.py:

Conditional Compilation in CUDA Source Files:

ROCm-specific Implementations:

Uh oh!

pytorch-bot bot commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1206

❌ 6 New Failures, 2 Unrelated Failures

Uh oh!

Uh oh!

Uh oh!

msaroufim commented Nov 4, 2024

Uh oh!

petrex commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Nov 5, 2024

Uh oh!

drisspg commented Nov 5, 2024

Uh oh!

pytorch-bot bot commented Jan 7, 2025

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

jcaip left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcaip Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

petrex Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcaip Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

jcaip Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

petrex Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

petrex Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrex commented Jan 9, 2025

Uh oh!

jcaip left a comment

Choose a reason for hiding this comment

Uh oh!

petrex commented Jan 16, 2025

Uh oh!

Uh oh!

petrex commented Oct 31, 2024 •

edited

Loading

ROCm Support in `setup.py`:

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

petrex commented Nov 5, 2024 •

edited

Loading

jcaip left a comment •

edited

Loading

petrex Jan 9, 2025 •

edited

Loading

petrex Jan 9, 2025 •

edited

Loading