Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404

Closed
liye0626 opened this issue Nov 16, 2023 · 10 comments
Closed

MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404

liye0626 opened this issue Nov 16, 2023 · 10 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@liye0626
Copy link

liye0626 commented Nov 16, 2023

When using inflight-batching (gptManagerBenchmark), we found that MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.

our case:

  • max_batch_size: 128,
  • requests number (sequence number): 128,
  • sequence_length: 1024,
    ** input length32,
    ** output length: 992.
  • model: LLaMA-2-7B(dummy weight).
  • hardware:A100*1 (80G),

we futher found that the reason is that In the case of in-flighting batching for MQA/GQA/MHA, the actual batch sizes in model inference are not the same. Why is this? Can we control the batch size of in-flighting batching so that all requests are completed in one batch?

@byshiue
Copy link
Collaborator

byshiue commented Nov 16, 2023

Sorry. I don't get your point about MHA's E2E latency is unusually larger than MQA/GQA. Can you explain more? The model should only choose either MHA or MQA/GQA.

Do you mean that you run LLaMA-7B with MHA and get latency A, and then run LLaMA-7B with MQA and get latency B, and *latency A is larger than latency B?

@byshiue byshiue self-assigned this Nov 16, 2023
@byshiue byshiue added the triaged Issue has been triaged by maintainers label Nov 16, 2023
@liye0626
Copy link
Author

sorry.

What I mean is indeed when run LLaMA-7B with MHA and get latency A, and then run LLaMA-7B with MQA and get latency B, but A is abnormally larger than B, MHA latency(133.40s)> GQA latency(44.74s) > MQA latency(29.17s).

We believe that MHA latency will not be higher than twice that of GQA, because GQA only reduces the weight of self.qkv and the softmax (QKT)V part (other than that the MLP part has not changed, and the latency of MLP is also very high) ).

@byshiue
Copy link
Collaborator

byshiue commented Nov 16, 2023

That's expected because MMHA is blocked by memory reading instead of computing. So, using MQA/GQA can reduce the memory loading from global memory and hence the latency would have great improvement.

@liye0626
Copy link
Author

liye0626 commented Nov 16, 2023

@byshiue
Thanks for your answer, but I'm still confused
Although MMHA is memory bound, I think MHA latency will not be more than twice MQA/GQA latency (MQA latency is indeed smaller than MHA latency).

We found that the GEMM latency of MHA is more than three times that of GQA. However, compared with MHA, GQA only reduces the GEMM of self.qkv(x). The calculation amount of MLP and GEMM of self.dense(x) is higher than that of self.qkv(x). So even if the GEMM in the self.qkv(x) part is reduced, it will not lead to the following result: MHA's GEMM latency exceeds three times that of GQA

latency MHA | GQA | MQA |
GEMM 74.23 | 20.07 | 11.81 |
MMHA 36.99| 16.57 | 11.95|

@byshiue
Copy link
Collaborator

byshiue commented Nov 16, 2023

What are the meaning of GEMM of MHA? Do you mean the case of MHA?
You could also check the GEMM shape.

@liye0626
Copy link
Author

@byshiue
Sorry for my poor expression.
I mean the case of MHA (ie LLaMA W/MHA).
We found that the actual batch sizes in model inference are not the same in the three cases of MHA/MQA/GQA. Is there any way to control the batch size in model inference when we use in-flighting batching by gptManagerBenchmark.

@byshiue
Copy link
Collaborator

byshiue commented Nov 17, 2023

In in-flight batching mode, it is related to request batching automatically and hard to fix the batch size well. I suggest you running under V1 mode if you only want to compare the MHA/MQA/GQA latency.

@liye0626
Copy link
Author

@byshiue Thank you for your suggestion.

@jdemouth-nvidia
Copy link
Collaborator

@liye0626 how many K/V heads do you use for GQA/MQA? During generation, the number of heads has a lot of impact on the size of the KV cache. If your case is dominated by the time taken by MHA/GQA/MQA, that extra bandwidth consumption will have an impact.

Also, MHA leading to a bigger KV cache reduces your chances to batch requests with in-flight batching. So you may be running with a smaller batch size and have a lower efficiency. That can impact both context and generation.

@jdemouth-nvidia
Copy link
Collaborator

I’m closing that issue as there’s no real action for us. Feel free to reopen if you think it’s needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants