Use FlashAttention for `multi_query_kv_attention` #4

WoosukKwon · 2023-03-02T05:05:56Z

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

FlashAttention is fast and memory-efficient.
FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

FlashAttention does NOT support FP32.
FlashAttention does not support head_size > 128. (This is fine for all models except GPT-J).
- Ref: Support for 256 head dim Dao-AILab/flash-attention#67
FlashAttention does not support attention bias (GPT-J, BLOOM, LLaMA).

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* Init * refine

Support for optimum-intel models

…o-model-executor Adapt OpenVINO CPU plugin implementation

BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida

patching for having type su

…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

Lookup buffer implementation

WoosukKwon added 8 commits March 2, 2023 04:20

Add a FlashAttention test

5685bac

Define MAX_SEQ_LEN

320b20c

Minor

4932c71

Use FlashAttention for multi_query_kv_attention

c302754

Add more head sizes for test

95e5c0f

Add error msgs

1c28f4f

Enhance the server script

eb9e9a0

Add flash-attn to README

72052b7

WoosukKwon merged commit 3e9f991 into main Mar 2, 2023

WoosukKwon deleted the flash-attn branch March 2, 2023 05:13

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

tmm1 mentioned this pull request Aug 3, 2023

Fix the rushed out multi-query kernel #44

Closed

xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Add BigDL Llama worker for batching on decoding (vllm-project#4)

02b4cac

* Init * refine

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Use FlashAttention for multi_query_kv_attention (vllm-project#4)

e7c912b

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#4 from slyalin/optimum_models

8a9862f

Support for optimum-intel models

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#4 from luo-cheng2021/luocheng/openvin…

658407a

…o-model-executor Adapt OpenVINO CPU plugin implementation

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Open

linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024

Merge pull request vllm-project#4 from beagleski/bapatra/patching-for-su

7646e00

patching for having type su

Alexei-V-Ivanov-AMD mentioned this pull request May 16, 2024

[Speculative decoding][Re-take] Enable TP>1 speculative decoding #4840

Merged

afeldman-nm mentioned this pull request May 21, 2024

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Open

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Open

haichuan1221 mentioned this pull request Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

haichuan1221 mentioned this pull request Jul 8, 2024

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Open

ehuaa mentioned this pull request Jul 19, 2024

[Bug]: The vllm is disconnected after running for some time #5084

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Open

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

wangwensuo mentioned this pull request Aug 22, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

Open

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#4 from KuntaiDu/yihua-lookup-buffer

c5b7232

Lookup buffer implementation

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Open

1 task

Clint-chan mentioned this pull request Sep 29, 2024

[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown #8933

Open

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

This was referenced Oct 12, 2024

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #6976

Closed

[Bug]: Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered #9306

Open

xxzhang0927 mentioned this pull request Oct 30, 2024

[Bug]: Engine iteration timed out. This should never happen! #9839

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FlashAttention for `multi_query_kv_attention` #4

Use FlashAttention for `multi_query_kv_attention` #4

WoosukKwon commented Mar 2, 2023 •

edited

Loading

Use FlashAttention for multi_query_kv_attention #4

Use FlashAttention for multi_query_kv_attention #4

Conversation

WoosukKwon commented Mar 2, 2023 • edited Loading

Pros

Cons

Use FlashAttention for `multi_query_kv_attention` #4

Use FlashAttention for `multi_query_kv_attention` #4

WoosukKwon commented Mar 2, 2023 •

edited

Loading