-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404
Comments
Sorry. I don't get your point about MHA's E2E latency is unusually larger than MQA/GQA. Can you explain more? The model should only choose either MHA or MQA/GQA. Do you mean that you run LLaMA-7B with MHA and get latency A, and then run LLaMA-7B with MQA and get latency B, and *latency A is larger than latency B? |
sorry. What I mean is indeed when run LLaMA-7B with MHA and get latency A, and then run LLaMA-7B with MQA and get latency B, but A is abnormally larger than B, MHA latency(133.40s)> GQA latency(44.74s) > MQA latency(29.17s). We believe that MHA latency will not be higher than twice that of GQA, because GQA only reduces the weight of self.qkv and the softmax (QKT)V part (other than that the MLP part has not changed, and the latency of MLP is also very high) ). |
That's expected because MMHA is blocked by memory reading instead of computing. So, using MQA/GQA can reduce the memory loading from global memory and hence the latency would have great improvement. |
@byshiue We found that the GEMM latency of MHA is more than three times that of GQA. However, compared with MHA, GQA only reduces the GEMM of self.qkv(x). The calculation amount of MLP and GEMM of self.dense(x) is higher than that of self.qkv(x). So even if the GEMM in the self.qkv(x) part is reduced, it will not lead to the following result: MHA's GEMM latency exceeds three times that of GQA latency MHA | GQA | MQA | |
What are the meaning of GEMM of MHA? Do you mean the case of MHA? |
@byshiue |
In in-flight batching mode, it is related to request batching automatically and hard to fix the batch size well. I suggest you running under V1 mode if you only want to compare the MHA/MQA/GQA latency. |
@byshiue Thank you for your suggestion. |
@liye0626 how many K/V heads do you use for GQA/MQA? During generation, the number of heads has a lot of impact on the size of the KV cache. If your case is dominated by the time taken by MHA/GQA/MQA, that extra bandwidth consumption will have an impact. Also, MHA leading to a bigger KV cache reduces your chances to batch requests with in-flight batching. So you may be running with a smaller batch size and have a lower efficiency. That can impact both context and generation. |
I’m closing that issue as there’s no real action for us. Feel free to reopen if you think it’s needed. Thanks! |
When using inflight-batching (gptManagerBenchmark), we found that MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.
our case:
** input length32,
** output length: 992.
we futher found that the reason is that In the case of in-flighting batching for MQA/GQA/MHA, the actual batch sizes in model inference are not the same. Why is this? Can we control the batch size of in-flighting batching so that all requests are completed in one batch?
The text was updated successfully, but these errors were encountered: