Replies: 3 comments 13 replies
-
Congratulations! I've been waiting for this day for so long! May I ask if this Flash Attention implementation will be integrated into PyTorch for ROCm in the future? |
Beta Was this translation helpful? Give feedback.
-
Was trying this, ran into a few issues and solved them as they came up, however I can't seem to run compile.py
Everything else seems to be okay up to this point after troubleshooting the other issues. |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm running on my setup as below: And I'm running in the docker image you provided. I'm seeing this error:
Is this due to I have another Graphics on the CPU? |
Beta Was this translation helpful? Give feedback.
-
Accelerate Inferencing on AMD RDNA3 GPUs with Composable Kernel library
Hello, and welcome to the AMD RDNA3 GPU High Performance Inferencing blog post. In this blog post, we will discuss how to use Composable kernel library to accelerate inferencing on AMD RDNA3 GPU. We will cover the following topics:
Table of Contents
1: RDNA3 Architecture on AI
RDNA3 is AMD's latest architecture designed for graphic as well as AI. We first include the AI accelerator unit in our RDNA line of products. This blog How to accelerate AI applications on RDNA 3 using WMMA intend to explain how to leverage the Wave Matrix Multiply Accumulate(WMMA) to accelerate AI workload, maximum the hardware efficiency with least effort. It is also recommend to read the "RDNA3" Instruction Set Architecture Reference Guide to get a better understanding of the RDNA3 architecture.
2: Model framework
AITemplate is an open-source high performance inference framework for optimizing machine learning models developed by Meta. You can find more information about AITemplate in the Tech Blog from Meta AI Faster, more flexible inference on GPUs using AITemplate, a revolutionary new inference engine. We've built good relationship with meta bring up the MI200 GPU support in AITemplate, and be pleasantly surprised by the performance it can offer. The success experience we got last year give us confidence to introduce the AITemplate as our model optimization tool for RDNA3 GPU end-to-end inferencing solution.
3: Kernel library
A upcoming release of Composable Kernel (CK) will include support for AMD RDNA3 GPUs. CK is a library of highly optimized kernels for machine learning. CK is designed to be a building block for machine learning frameworks. CK is written in C++ and is designed to be portable across different hardware architectures and operating systems. CK is designed to be easy to use and easy to integrate into existing applications.
Regarding kernel optimization, we
Follow the data parallel or blocking algorithm which is widely used in modern BLAS libraries. Blocking level keep consistent with CK conceptual abstraction shown above as well as the hardware architecture.
Adopt Implicit GEMM method to transform convolution problem to GEMM problem, so that we can reuse the building blocks from thread-wise to grid-wise, the only difference is the data layout of input and output where Tensor Transform feature in CK help us implement it easily in Device Operation level.
Implement Flash Attention in attention layer to reduce memory pressure which proposed by Tri Dao in his recent paper.
4: Stable-Diffusion Performance benchmark
We test in such hardware environment:
on workstation with Ryzen Threadripper 3960X CPU.
We got latency of ~1.8s/image with such input:
5: Stable-Diffusion demo
For ease of use, we provide a out-of-box docker image with web-ui for Stable-diffusion, which could be used to experience high performance inferencing on RX7900XTX/XT & W7900 GPU with a few clicks.
5.1: Prerequisite
5.2: Build from source
Execute following step-by-step commands in docker: rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1, assuming your working directory is ~/workspace
5.2.1 Build specific HIP compiler
5.2.2 Build AIT and Stable Diffusion demo
Dry run stable-diffusion demo with benchmark
Run with simple web-ui
5.3: Out-of-box docker environment
To make sure the environment is identical and save user's time to build and compile, we construct an out-of-box usage docker for this work in:
After launching the docker container, you can simply reproduce the demo without tear.
Conclusion
In this blog, we introduced an end-to-end AI high inference solution for AMD RDNA3 GPUs, which includes a set of optimized kernels for Stable-Diffusion. We also provide a step-by-step build guide to help users experience high performance inferencing on RX7900XTX/XT GPU. As a open-source project, we welcome any feedback and contribution from the community.
About the Authors
Beta Was this translation helpful? Give feedback.
All reactions