Skip to content

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

deepsek
Copy link

@deepsek deepsek commented Jul 10, 2025

  • Added Matrix cores support (MFMA instructions) for MMQ kernels.

  • Enable stream-K for CDNA3 to work with MMQ kernels.

  • Removed usage of WARP_SIZE hardcoded constant in MMQ kernels.

  • NOTE: Thoughts on removing all uses of hardcoded const specific to only NVIDIA (like WARP_SIZE) in order to support other GPUs?

@JohannesGaessler @ggerganov
P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

EDIT:
Update to add some performance charts for DeepSeekV3 model.

Upstream vs ROCm Fork Development
image

MI300X vs H100 Throughput Test
image

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 10, 2025
@JohannesGaessler
Copy link
Collaborator

I would be happy to get on a call with you to discuss AMD hardware support, my email address can be found on my Github page.

@ggerganov
Copy link
Member

P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

@deepsek Thanks for the contribution and for reaching out. On topics related to the CUDA backend, @JohannesGaessler is the best person to consult with. For additional backends, @slaren can provide guidelines and advice. I'll be happy to provide input on any matters as well.

I am also available for call - feel free to contact me.

@Dampfinchen
Copy link

Dampfinchen commented Jul 11, 2025

Very nice to see the initiative. I assume improvements made for CDNA will also swap into the consumer side next year when UDNA releases. So this is exciting news for the future of AMD products!

@IMbackK
Copy link
Collaborator

IMbackK commented Jul 12, 2025

This certainly is good news

@JohannesGaessler
Copy link
Collaborator

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

@IMbackK
Copy link
Collaborator

IMbackK commented Jul 14, 2025

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

Yes, certainly. It would help to avoid duplication of effort. i can be reached via email at uvos.xyz user carl

@deepsek deepsek requested a review from ngxson as a code owner July 15, 2025 16:53
@github-actions github-actions bot added the devops improvements to build systems and github actions label Jul 15, 2025
@deepsek
Copy link
Author

deepsek commented Jul 21, 2025

Hi @JohannesGaessler, is there any blocker for merging this PR to the main branch?

@IMbackK
Copy link
Collaborator

IMbackK commented Jul 21, 2025

@deepsek There are a few small things as discussed, better naming for this mfma path so that a rdna wmma solution can be added later without the nameing being strange is one thing, use of two V_MFMA_I32_16X16X16I8 instructions on gfx908 and gfx90a, even if this path is not chosen for those, to ease maintainability is another.

I would also like to try this myself on gfx94x somehow and i am not sure what the state is with regard to access to amds cloud for maintenance of a gfx94x specific code path, maybe @ggerganov can also comment on that. A problem here being that after cdna2/gfx90a/mi210 AMD has not made any further CDNA devices that are in a pcie addon board form factor, so out side of the acquisition of an entire mi300 oam machine no one can simply add a CDNA3/gfx94x/MI3xx compatible card to their system.

@IMbackK IMbackK self-assigned this Jul 21, 2025
@deepsek
Copy link
Author

deepsek commented Jul 21, 2025

@IMbackK ,

  • regarding the naming, yes ofcourse. any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?

  • For this PR which is specifically targeting CDNA3, I don't think the V_MFMA_I32_16X16X16I8 for CDNA2 & CDNA1 should be a blocker for CDNA3 specific workloads since the code paths are currently different. We can open a new PR for that, Even though we can add 2 instructions, it would need testing, etc

  • Since the performance for CDNA3 is significantly higher than the upstream version (see pics in the main comment for PR), we would like to see this change available to public asap. We can incorporate other architectures in other PRs.

  • Regarding access to MI300X instances to public, anyone should be able to spin up a instance on amd dev cloud.
    https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html

@ggerganov
Copy link
Member

I tested this PR on my RTX 2060 and don't observe any regressions.

@deepsek If you run the ./bin/test-backed-ops -o MUL_MAT do all tests pass on the target hardware?

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a partial review because I'm noticing that some calculations on master are wrong in terms of the logic (but correct in terms of the values). My suggestion for this PR is that I fix the incorrect logic on master first and that we then rebase this PR.

My code in mma.cuh implicitly assumes that for int8 and FP16 the raw data layout across threads in a warp is the same. Please tell me if this assumption is wrong for AMD.

@JohannesGaessler
Copy link
Collaborator

any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?

If the instructions are only available on CDNA3 my suggestion is CDNA3_MMA_AVAILABLE.

@IMbackK
Copy link
Collaborator

IMbackK commented Jul 21, 2025

any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?

If the instructions are only available on CDNA3 my suggestion is CDNA3_MMA_AVAILABLE.

the used MFMA instruction, V_MFMA_I32_16X16X32I8 is only available on CDNA3, the other CDNA devices support only k=16 ie V_MFMA_I32_16X16X16I8, obviously its pretty trivial to use V_MFMA_I32_16X16X32I8 vs V_MFMA_I32_16X16X16I8, whatever is available, the instructions are otherwise the same.

The other used MFMA instruction is the same with CDNA3 haveing V_MFMA_I32_32X32X16I8 and CDNA1/2 haveing only V_MFMA_I32_32X32X8I8

(consumer) rdna3+ gpus support the same-purpose V_WMMA_I32_16X16X16_IU8 but semantics and performance characteristics are quite different.

I would suggest the obvious AMD_MFMA_AVAILABLE and AMD_WMMA_AVAILABLE with the AMD_MFMA_AVAILABLE path using V_MFMA_I32_16X16X32I8 or V_MFMA_I32_16X16X16I8 based on the presence of the CDNA3 define.

@JohannesGaessler
Copy link
Collaborator

This is only a partial review because I'm noticing that some calculations on master are wrong in terms of the logic (but correct in terms of the values). My suggestion for this PR is that I fix the incorrect logic on master first and that we then rebase this PR.

It seems I misremembered how the code works, the part I though was wrong is correct after all. I'll continue with the review.

@deepsek
Copy link
Author

deepsek commented Jul 22, 2025

@JohannesGaessler, I've refactored the code per your suggestions in the comments above.
@ggerganov, No regressions on my end too. All tests passed!

@IMbackK,

  • Per your suggestion, renamed amd_mma to amd_mfma
  • I've also added the MFMA instructions for CDNA2 (CDNA uses same instructions) in the mma.cuh file. Even though the instructions are added, the code path is disabled in common.cuh with amd_mfma_available. Same goes for stream-K in mmq.cu file.
  • I've tested for functionality, and it works. But its disabled by default because, for llama3.1-8B model on CDNA2, with stream-K and MFMA enabled, the performance is better only up until prompt size of 256/512. Anything above (512-4096 tested), the performance drops with the use of the matrix cores. I haven't had the chance to test on CDNA but I suspect it will be the same as CDNA2.
  • We can maybe in the future, look into enabling MFMA code path for only smaller prompt sizes in the legacy CDNAs. But it's not in my roadmap at the moment.

@IMbackK
Copy link
Collaborator

IMbackK commented Jul 22, 2025

@JohannesGaessler, I've refactored the code per your suggestions in the comments above. @ggerganov, No regressions on my end too. All tests passed!

@IMbackK,

* Per your suggestion, renamed `amd_mma` to `amd_mfma`

Looks good, thank you.

* I've also added the MFMA instructions for CDNA2 (CDNA uses same instructions) in the `mma.cuh` file. Even though the instructions are added, the code path is disabled in `common.cuh` with `amd_mfma_available`. Same goes for `stream-K` in `mmq.cu` file.

* I've tested for functionality, and it works. But its disabled by default because, for llama3.1-8B model on CDNA2, with `stream-K` and `MFMA` enabled, the performance is **better** only up until prompt size of 256/512. Anything above (512-4096 tested), the performance drops with the use of the matrix cores. I haven't had the chance to test on CDNA but I suspect it will be the same as CDNA2.

Thats good, im mainly concerned with being able to test this code path, for which this will do fine.

* We can maybe in the future, look into enabling MFMA code path for only smaller prompt sizes in the legacy CDNAs. But it's not in my roadmap at the moment.

I will take some time hopefully soon to tune for cdna1, thats what i mainly do around here after all.

I will add my approval once i have had the chance to test this path both on gfx908 and gfx94x and ci has completed successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants