Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support computation pipelining after SWP refactoring #5185

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

manman-ren
Copy link
Collaborator

With the recent SWP refactoring, it is much easier to support arbitrary stage assignments where computations can be separated into different stages. Computation pipelining is basically splitting computations to different stages. Take flash attention as an example:
Currently the two loads are in stage 0 (S0), all other ops are in the last stage (stage 2). The loop body will look like
MMA0(i)
Softmax(i)
MUL(i)
MMA1(i)
LoadV(i+2)
LoadK(i+2)

This patch defines two different pipeline schedule for attention-like kernels:
1> putting first dot in S2, other computations in S3, loadK in stage 0, loadV in stage 1
MMA0(i+1)
Softmax(i)
MUL(i)
MMA1(i)
loadK(i+3)
loadV(i+2)
2> putting second dot in S3, other computations in S2, loadK in stage 0, loadV in stage 1
MMA0(i+1)
MMA1(i)
Softmax(i+1)
MUL(i+1)
loadK(i+3)
loadV(i+2)

Preliminary performance number on H100 for flash attention:
(Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2_opt-tflops triton_tutorial_flash_v2_tma-tflops triton_tutorial_flash_v2-tflops


         (8, 16, 8192, 128)                                517.528                                504.565                            481.402

The implementation and the frontend is preliminary for discussion.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren manman-ren marked this pull request as draft November 18, 2024 21:50
@manman-ren
Copy link
Collaborator Author

@pawelszczerbuk The frontend is an annotation on loop, and inside the LoopSchedule pass, we are using the annotation to see if the ttgir matches with the specific schedule, if it does, we perform the corresponding <stage, cluster> assignment.

I understand that you are working on further refactoring and maybe frontend design for specifying a loop schedule. This PR is mostly to share the performance numbers and the preliminary implementation. Happy to work together on enabling this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant