-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix]: Fix DualChunkFlashAttention for short sequences #19084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: sa-buc <shanhaikang.shk@oceanbase.com>
Signed-off-by: sa-buc <shanhaikang.shk@oceanbase.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: sa-buc <shanhaikang.shk@oceanbase.com>
|
With this fix i get another error when i use fp8 quantization (--quantization fp8) ERROR 06-04 14:07:19 [engine.py:164] RuntimeError('query and key must have the same dtype')
ERROR 06-04 14:07:19 [engine.py:164] Traceback (most recent call last):
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 162, in start
ERROR 06-04 14:07:19 [engine.py:164] self.run_engine_loop()
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 225, in run_engine_loop
ERROR 06-04 14:07:19 [engine.py:164] request_outputs = self.engine_step()
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 251, in engine_step
ERROR 06-04 14:07:19 [engine.py:164] raise e
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 234, in engine_step
ERROR 06-04 14:07:19 [engine.py:164] return self.engine.step()
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1393, in step
ERROR 06-04 14:07:19 [engine.py:164] outputs = self.model_executor.execute_model(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 140, in execute_model
ERROR 06-04 14:07:19 [engine.py:164] output = self.collective_rpc("execute_model",
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-04 14:07:19 [engine.py:164] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 06-04 14:07:19 [engine.py:164] return func(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 06-04 14:07:19 [engine.py:164] output = self.model_runner.execute_model(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-04 14:07:19 [engine.py:164] return func(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1843, in execute_model
ERROR 06-04 14:07:19 [engine.py:164] hidden_or_intermediate_states = model_executable(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-04 14:07:19 [engine.py:164] return self._call_impl(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-04 14:07:19 [engine.py:164] return forward_call(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 481, in forward
ERROR 06-04 14:07:19 [engine.py:164] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 06-04 14:07:19 [engine.py:164] return self.forward(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 358, in forward
ERROR 06-04 14:07:19 [engine.py:164] hidden_states, residual = layer(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-04 14:07:19 [engine.py:164] return self._call_impl(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-04 14:07:19 [engine.py:164] return forward_call(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 257, in forward
ERROR 06-04 14:07:19 [engine.py:164] hidden_states = self.self_attn(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-04 14:07:19 [engine.py:164] return self._call_impl(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-04 14:07:19 [engine.py:164] return forward_call(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 187, in forward
ERROR 06-04 14:07:19 [engine.py:164] attn_output = self.attn(q, k, v)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-04 14:07:19 [engine.py:164] return self._call_impl(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-04 14:07:19 [engine.py:164] return forward_call(*args, **kwargs)
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/layer.py", line 237, in forward
ERROR 06-04 14:07:19 [engine.py:164] return torch.ops.vllm.unified_attention(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
ERROR 06-04 14:07:19 [engine.py:164] return self._op(*args, **(kwargs or {}))
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/layer.py", line 386, in unified_attention
ERROR 06-04 14:07:19 [engine.py:164] output = self.impl.forward(self, query, key, value, kv_cache,
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 493, in forward
ERROR 06-04 14:07:19 [engine.py:164] self._dual_chunk_flash_attn_prefill(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 673, in _dual_chunk_flash_attn_prefill
ERROR 06-04 14:07:19 [engine.py:164] current_out = self._dual_chunk_flash_attn_prefill_func(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1055, in _dual_chunk_flash_attn_prefill_func
ERROR 06-04 14:07:19 [engine.py:164] flash_result = self._do_flash_attn(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 1207, in _do_flash_attn
ERROR 06-04 14:07:19 [engine.py:164] output, softmax_lse = flash_attn_varlen_func(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 227, in flash_attn_varlen_func
ERROR 06-04 14:07:19 [engine.py:164] out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
ERROR 06-04 14:07:19 [engine.py:164] File "/home/pierre/idextend/venv/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
ERROR 06-04 14:07:19 [engine.py:164] return self._op(*args, **(kwargs or {}))
ERROR 06-04 14:07:19 [engine.py:164] RuntimeError: query and key must have the same dtype
|
This issue is likely caused by the fact that DCA is implemented based on FlashAttention, which currently lacks support for FP8 quantization in its v2 version. |
|
@sighingnow can you please review this PR since you are the most familiar with DCA? |
|
Hello, @sighingnow ! This is a small fix. Could you take a look when you have time? |
|
It's a simple fix and original author is not responding, could it be merged ? |
|
ref #21364 |
I'm using this fixed version on an RTX 5090 with CUDA 12.8, but token generation is very slow—about 12 tokens/s. It's even slower than the same model running on an RTX 4090. Is there something wrong? Everything else works fine, but the speed is extremely slow. command: VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN CUDA_VISIBLE_DEVICES=1 vllm serve /data1/fanwei/models/qwen2.5-14b-instruct-1m-gptq-int4 env info: Python 3.12.3 |
|
Please don’t comment in this outdated PR, go on the main one Vllm flash attn need a patch to be working on rtx 5090 , did you apply the sm120 patch cited on the original pr ? |
This issue has already been fixed on main, by #21364, and I can confirm that the original cases works on main, closing. |
Env Info
Problem Description
DualChunkFlashAttention fails to handle short prompts correctly. This issue can be reproduced by modifying the qwen_1m.py as follows:
It failed to handle the simple prompt
Hello, world!, the following assertion error is raised during execution:Proposed Fix
This issue is introduced in #11844 .
Since the key_states and value_states are directly retrieved from the KV cache through the block_table, setting
block_tableis both wrong and unnecessary.