MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.

When using inflight-batching (gptManagerBenchmark), we found that MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.

our case: 
* max_batch_size: 128,
*  requests number (sequence number): 128, 
* sequence_length: 1024, 
** input length32, 
** output length: 992. 
* model: LLaMA-2-7B(dummy weight). 
* hardware:A100*1 (80G), 

we futher found that the reason is that In the case of in-flighting batching for MQA/GQA/MHA, the actual batch sizes in model inference are not the same. Why is this? Can we control the batch size of in-flighting batching so that all requests are completed in one batch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions