Skip to content

MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B. #404

Closed
@liye0626

Description

@liye0626

When using inflight-batching (gptManagerBenchmark), we found that MHA's E2E latency is unusually larger than MQA/GQA on LLaMA-7B.

our case:

  • max_batch_size: 128,
  • requests number (sequence number): 128,
  • sequence_length: 1024,
    ** input length32,
    ** output length: 992.
  • model: LLaMA-2-7B(dummy weight).
  • hardware:A100*1 (80G),

we futher found that the reason is that In the case of in-flighting batching for MQA/GQA/MHA, the actual batch sizes in model inference are not the same. Why is this? Can we control the batch size of in-flighting batching so that all requests are completed in one batch?

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions