Stable diffusion with RX7900XTX on ROCm5.7 #1032

aska-0096 · 2023-11-10T06:00:46Z

aska-0096
Nov 10, 2023
Collaborator

Accelerate Inferencing on AMD RDNA3 GPUs with Composable Kernel library

Hello, and welcome to the AMD RDNA3 GPU High Performance Inferencing blog post. In this blog post, we will discuss how to use Composable kernel library to accelerate inferencing on AMD RDNA3 GPU. We will cover the following topics:

1: RDNA3 Architecture on AI

RDNA3 is AMD's latest architecture designed for graphic as well as AI. We first include the AI accelerator unit in our RDNA line of products. This blog How to accelerate AI applications on RDNA 3 using WMMA intend to explain how to leverage the Wave Matrix Multiply Accumulate(WMMA) to accelerate AI workload, maximum the hardware efficiency with least effort. It is also recommend to read the "RDNA3" Instruction Set Architecture Reference Guide to get a better understanding of the RDNA3 architecture.

2: Model framework

AITemplate is an open-source high performance inference framework for optimizing machine learning models developed by Meta. You can find more information about AITemplate in the Tech Blog from Meta AI Faster, more flexible inference on GPUs using AITemplate, a revolutionary new inference engine. We've built good relationship with meta bring up the MI200 GPU support in AITemplate, and be pleasantly surprised by the performance it can offer. The success experience we got last year give us confidence to introduce the AITemplate as our model optimization tool for RDNA3 GPU end-to-end inferencing solution.

3: Kernel library

A upcoming release of Composable Kernel (CK) will include support for AMD RDNA3 GPUs. CK is a library of highly optimized kernels for machine learning. CK is designed to be a building block for machine learning frameworks. CK is written in C++ and is designed to be portable across different hardware architectures and operating systems. CK is designed to be easy to use and easy to integrate into existing applications.

Regarding kernel optimization, we

Follow the data parallel or blocking algorithm which is widely used in modern BLAS libraries. Blocking level keep consistent with CK conceptual abstraction shown above as well as the hardware architecture.
Adopt Implicit GEMM method to transform convolution problem to GEMM problem, so that we can reuse the building blocks from thread-wise to grid-wise, the only difference is the data layout of input and output where Tensor Transform feature in CK help us implement it easily in Device Operation level.
Implement Flash Attention in attention layer to reduce memory pressure which proposed by Tri Dao in his recent paper.

4: Stable-Diffusion Performance benchmark

We test in such hardware environment:

RX7900XTX at "auto" performance preset, with ROCm 5.7.0, ROCmSoftwarePlatform/AITemplate navi3_rel_ver_1.0@8d25005
on workstation with Ryzen Threadripper 3960X CPU.

We got latency of ~1.8s/image with such input:

input width = 512, height =512, batchsize =1, iteration =50, prompt ="A vision of paradise, Unreal Engine"

5: Stable-Diffusion demo

For ease of use, we provide a out-of-box docker image with web-ui for Stable-diffusion, which could be used to experience high performance inferencing on RX7900XTX/XT & W7900 GPU with a few clicks.

5.1: Prerequisite

Linux OS with docker-daemon service installed.
AMD RX7900XTX/XT/W7900 GPU with driver installed.

5.2: Build from source

Execute following step-by-step commands in docker: rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1, assuming your working directory is ~/workspace

5.2.1 Build specific HIP compiler

git clone -b amd-stg-open https://github.com/RadeonOpenCompute/llvm-project.git

git checkout 1f2f539f7cab51623fad8c8a5b574eda1e81e0c0

cd llvm-project && mkdir -p build && cd build

cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" ../llvm

make -j16

export HIP_CLANG_PATH=~/workspace/llvm-project/build/bin/

5.2.2 Build AIT and Stable Diffusion demo

cd ~/workspace

git clone -b navi3_rel_ver_1.0 --recursive  https://github.com/ROCmSoftwarePlatform/AITemplate

cd AITemplate/python

sh rebuild.sh

cd ../examples/05_stable_diffusion/

pip install --upgrade pip

pip3 install diffusers==0.11.0 transformers==4.25.0 click accelerate scipy

python3 scripts/download_pipeline.py --token <Your HuggingFace Access Token>

python3 scripts/compile.py

Dry run stable-diffusion demo with benchmark

python3 scripts/demo.py

Run with simple web-ui

pip3 install uvicorn fastapi streamlit

sh run_ait_sd_webui.sh

5.3: Out-of-box docker environment

To make sure the environment is identical and save user's time to build and compile, we construct an out-of-box usage docker for this work in:

aska0096/rocm5.7_ait_ck_navi31_sd2:v1.0

After launching the docker container, you can simply reproduce the demo without tear.

cd ~/AITemplate/examples/05_stable_diffusion
sh run_ait_sd_webui.sh

Conclusion

In this blog, we introduced an end-to-end AI high inference solution for AMD RDNA3 GPUs, which includes a set of optimized kernels for Stable-Diffusion. We also provide a step-by-step build guide to help users experience high performance inferencing on RX7900XTX/XT GPU. As a open-source project, we welcome any feedback and contribution from the community.

About the Authors

Haocong Wang, Composable Kernel library software engineer, responsible for kernel implementation & benchmark, contact with haocwang@amd.com
Yanxing Shi, AIT Framework ROCm backend software engineer, responsible for model optimization & compatibilty, contact with yanxing.shi@amd.com
Sixie Fang, AIT Framework ROCm backend software engineer, responsible for daily maintenance of AIT framework ROCm backend, contact with sixie.fang@amd.com
Carlus Huang, AIT Framework ROCm backend Technical Leader, contact with carlus.huang@amd.com

evshiron · 2023-11-14T07:33:57Z

evshiron
Nov 14, 2023

Congratulations! I've been waiting for this day for so long!

May I ask if this Flash Attention implementation will be integrated into PyTorch for ROCm in the future?

1 reply

aska-0096 Nov 23, 2023
Collaborator Author

Emmm I'm not sure the integration plan of navi31's flash attention.

we have our MI100/MI210 flash attention implementation in this repo: https://github.com/ROCmSoftwarePlatform/flash-attention

Maybe I will integrate to here first if I have time. (Recently I'm busy on another high priority project)

Rogue-Factor · 2023-11-23T11:02:15Z

Rogue-Factor
Nov 23, 2023

Was trying this, ran into a few issues and solved them as they came up, however I can't seem to run compile.py

2023-11-23 06:00:00,816 INFO <aitemplate.testing.detect_target> Set target to ROCM
Traceback (most recent call last):
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 107, in <module>
    compile_diffusers()
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 62, in compile_diffusers
    ).to("cuda")
      ^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/diffusers/pipeline_utils.py", line 270, in to
    module.to(torch_device)
  File "/home/nate/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/nate/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/nate/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/torch/cuda/__init__.py", line 298, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Everything else seems to be okay up to this point after troubleshooting the other issues.

10 replies

aska-0096 Nov 29, 2023
Collaborator Author

@Rogue-Factor As your log show:

  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1101

You are might using 7700XT or 7800XT, I'm trying support these two SKUs.

aska-0096 Nov 30, 2023
Collaborator Author

Hi @Rogue-Factor, The Navi32 ( 7700XT, 7800XT), Navi33 (7600) support has been added, please have a try

Rogue-Factor Nov 30, 2023

@aska-0096

So, promising results, I'm now past the CUDA errors. But I'm getting:

2023-11-30 08:02:17,737 INFO <aitemplate.testing.detect_target> Set target to ROCM
Traceback (most recent call last):
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 106, in <module>
    compile_diffusers()
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 71, in compile_diffusers
    compile_clip(
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_clip.py", line 99, in compile_clip
    params_ait = map_clip_params(pt_mod, batch_size, seqlen, depth)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_clip.py", line 41, in map_clip_params
    qkv_weight = torch.cat([q, k, v], dim=0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

At compile.py.

After passing HIP_LAUNCH_BLOCKING=1 to compile.py

2023-11-30 08:08:47,921 INFO <aitemplate.testing.detect_target> Set target to ROCM
Traceback (most recent call last):
  File "/home/nate/.local/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 326, in load_config
    config_file = hf_hub_download(
                  ^^^^^^^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/nate/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './tmp/diffusers-pipeline/stabilityai/stable-diffusion-v2'. Use `repo_type` argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/./compile.py", line 106, in <module>
    compile_diffusers()
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/amd-demo/AITemplate/examples/05_stable_diffusion/scripts/./compile.py", line 57, in compile_diffusers
    pipe = StableDiffusionPipeline.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/diffusers/pipeline_utils.py", line 459, in from_pretrained
    config_dict = cls.load_config(
                  ^^^^^^^^^^^^^^^^
  File "/home/nate/.local/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 363, in load_config
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./tmp/diffusers-pipeline/stabilityai/stable-diffusion-v2 is not the path to a directory containing a model_index.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.

Just for additional details.

Really appreciate your work so far, hopefully this information helps.

aska-0096 Dec 4, 2023
Collaborator Author

Hi @Rogue-Factor .
It looks like you still running in different environment than in docker "rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1". Would you please try it?

aska-0096 Dec 8, 2023
Collaborator Author

Hi @Rogue-Factor ,
Another option provided, you can follow the chapter 5.3 to use out-of-box usage docker image.

ZchiPitt · 2023-12-07T05:19:50Z

ZchiPitt
Dec 7, 2023

Hello, I'm running on my setup as below:
CPU: AMD Ryzen 7 7700X
GPU: AMD Radeon RX 7900XTX
OS: Ubuntu 22.04

And I'm running in the docker image you provided.

I'm seeing this error:

➜  05_stable_diffusion git:(navi3_rel_ver_1.0) python3 scripts/compile.py   

INFO:aitemplate.backend.build_cache_base:Build cache disabled
2023-12-07 05:15:35,247 INFO <aitemplate.testing.detect_target> Set target to ROCM
Traceback (most recent call last):
  File "/root/workspace/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 106, in <module>
    compile_diffusers()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/root/workspace/AITemplate/examples/05_stable_diffusion/scripts/compile.py", line 71, in compile_diffusers
    compile_clip(
  File "/root/workspace/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_clip.py", line 99, in compile_clip
    params_ait = map_clip_params(pt_mod, batch_size, seqlen, depth)
  File "/root/workspace/AITemplate/examples/05_stable_diffusion/src/compile_lib/compile_clip.py", line 41, in map_clip_params
    qkv_weight = torch.cat([q, k, v], dim=0)
RuntimeError: HIP error: invalid device function
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Is this due to I have another Graphics on the CPU?

2 replies

evshiron Dec 7, 2023

It's possible.

You might want to try listing all your ROCm devices by rocminfo, and if your RX 7900 XTX is the first GPU (excluding CPUs) in the list, you can export HIP_VISIBLE_DEVICES=0 so that PyTorch will only see that GPU, thus avoid some weird issues.

If you want some Python code, here is how SD:Next supports AMD GPUs:

https://github.com/vladmandic/automatic/blob/master/installer.py#L371

aska-0096 Dec 8, 2023
Collaborator Author

Hi @ZchiPitt , Please follow evshiron's advice and have a try : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable diffusion with RX7900XTX on ROCm5.7 #1032

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Stable diffusion with RX7900XTX on ROCm5.7 #1032

aska-0096 Nov 10, 2023 Collaborator

Accelerate Inferencing on AMD RDNA3 GPUs with Composable Kernel library

Table of Contents

1: RDNA3 Architecture on AI

2: Model framework

3: Kernel library

4: Stable-Diffusion Performance benchmark

5: Stable-Diffusion demo

5.1: Prerequisite

5.2: Build from source

5.2.1 Build specific HIP compiler

5.2.2 Build AIT and Stable Diffusion demo

5.3: Out-of-box docker environment

Conclusion

About the Authors

Replies: 3 comments · 13 replies

evshiron Nov 14, 2023

aska-0096 Nov 23, 2023 Collaborator Author

Rogue-Factor Nov 23, 2023

aska-0096 Nov 29, 2023 Collaborator Author

aska-0096 Nov 30, 2023 Collaborator Author

Rogue-Factor Nov 30, 2023

aska-0096 Dec 4, 2023 Collaborator Author

aska-0096 Dec 8, 2023 Collaborator Author

ZchiPitt Dec 7, 2023

evshiron Dec 7, 2023

aska-0096 Dec 8, 2023 Collaborator Author

aska-0096
Nov 10, 2023
Collaborator

Replies: 3 comments 13 replies

evshiron
Nov 14, 2023

aska-0096 Nov 23, 2023
Collaborator Author

Rogue-Factor
Nov 23, 2023

aska-0096 Nov 29, 2023
Collaborator Author

aska-0096 Nov 30, 2023
Collaborator Author

aska-0096 Dec 4, 2023
Collaborator Author

aska-0096 Dec 8, 2023
Collaborator Author

ZchiPitt
Dec 7, 2023

aska-0096 Dec 8, 2023
Collaborator Author