[Performance] Use optimized kernels for MQA/GQA

In theory, MQA/GQA can reduce memory bandwidth for reading KV cache and enable using TensorCore for the dot products in attention mechanism. However, this benefit can be only realized when using optimized kernels that vLLM does not have at the moment.

1. For prefill, vLLM explicitly expands the incoming keys and values before running the attention op: https://github.com/vllm-project/vllm/blob/e5452ddfd6e9a08d5e15bd81a010934550b9b507/vllm/model_executor/layers/attention.py#L121-L128 because xformers (nor PyTorch SDPA) does not support MQA/GQA at the moment. This is bad for performance since 1) it causes extra overhead of expanding the tensor, and 2) the attention kernel cannot leverage the advantage described above. While [FlashAttention](https://github.com/Dao-AILab/flash-attention#how-to-use-flashattention) efficiently supports MQA/GQA, we need to use it carefully since it does not cover all GPUs/data types/head sizes that xformers supports.
2. For decode, vLLM's current paged attention kernel also does not leverage the benefits of MQA/GQA. To enjoy the benefit, we need to either significantly rewrite the paged attention kernel, or modify the FlashAttention kernel to support paged KV cache.

	key = key[:, :,
	None, :].expand(key.shape[0], self.num_kv_heads,
	self.num_queries_per_kv,
	key.shape[-1])
	value = value[:, :, None, :].expand(value.shape[0],
	self.num_kv_heads,
	self.num_queries_per_kv,
	value.shape[-1])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Use optimized kernels for MQA/GQA #1880

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance] Use optimized kernels for MQA/GQA #1880

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions