-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for new gfx1200 and gfx1201 targets #12372
base: master
Are you sure you want to change the base?
Add support for new gfx1200 and gfx1201 targets #12372
Conversation
CC: @powderluv |
@JohannesGaessler Could you please update the labels because I don't have correct permissions for that: GraphQL: slojosic-amd does not have the correct permissions to execute |
@@ -189,7 +189,7 @@ The following compilation options are also available to tweak performance: | |||
|
|||
| Option | Legal values | Default | Description | | |||
|-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | |||
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. | | |||
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, RDNA3, RDNA4). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CDNA too, maybe condense as V100, CDNA and RDNA3+
#define GGML_CUDA_CC_RDNA1 (GGML_CUDA_CC_OFFSET_AMD + 0x1010) // RX 5000 | ||
#define GGML_CUDA_CC_RDNA2 (GGML_CUDA_CC_OFFSET_AMD + 0x1030) // RX 6000, minimum for dp4a | ||
#define GGML_CUDA_CC_RDNA3 (GGML_CUDA_CC_OFFSET_AMD + 0x1100) // RX 7000, minimum for WMMA | ||
#define GGML_CUDA_CC_RDNA4 (GGML_CUDA_CC_OFFSET_AMD + 0x1200) // RX 9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to add RDNA4 you need to also change GGML_CUDA_CC_IS_RDNA3 to not match RDNA4
cu_compute_type = CUBLAS_COMPUTE_32F; | ||
alpha = &alpha_f32; | ||
beta = &beta_f32; | ||
|
||
if (GGML_CUDA_CC_IS_RDNA4(compute_capability)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you test for RDNA4 in a branch that tests for CDNA, makes no sense.
@@ -1214,7 +1214,7 @@ static void ggml_cuda_op_mul_mat_cublas( | |||
|
|||
CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); | |||
|
|||
if (GGML_CUDA_CC_IS_CDNA(compute_capability)) { | |||
if (GGML_CUDA_CC_IS_CDNA(compute_capability) || GGML_CUDA_CC_IS_RDNA4(compute_capability)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If V_WMMA_F32_16X16X16_F16 dose better here than V_WMMA_F16_16X16X16_F16 on rdna4 it stands to reason that it dose on rdna3 too.
No description provided.