Import FlashInfer: 3x faster PagedAttention than vLLM #2767

casper-hansen · 2024-02-05T18:21:00Z

It looks like vLLM could directly import the PagedAttention kernels from FlashInfer to support GQA. "For batch GQA decoding attention, FlashInfer w/ Tensor Cores is 3x faster than vLLM PagaAttention when batch_size=64." @WoosukKwon

https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html

zhuohan123 · 2024-02-05T19:32:14Z

We are talking to the FlashInfer team and working on merging it with vLLM!

sumo43 · 2024-02-05T20:19:26Z

Made a draft PR implementing flashinfer. Would love to help merge it. #2772

casper-hansen mentioned this issue Feb 5, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

simon-mo mentioned this issue Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

linear bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 6, 2024

simon-mo reopened this Aug 6, 2024

simon-mo closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

casper-hansen commented Feb 5, 2024

zhuohan123 commented Feb 5, 2024

sumo43 commented Feb 5, 2024

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

Comments

casper-hansen commented Feb 5, 2024

zhuohan123 commented Feb 5, 2024

sumo43 commented Feb 5, 2024