Closed
Description
When using inflight-batching (gptManagerBenchmark), we found that MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.
our case:
- max_batch_size: 128,
- requests number (sequence number): 128,
- sequence_length: 1024,
** input length32,
** output length: 992. - model: LLaMA-2-7B(dummy weight).
- hardware:A100*1 (80G),
we futher found that the reason is that In the case of in-flighting batching for MQA/GQA/MHA, the actual batch sizes in model inference are not the same. Why is this? Can we control the batch size of in-flighting batching so that all requests are completed in one batch?