-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152
Comments
same problem happen to me. Is this bug in progress? |
@DeJoker do you also see it in unit test or other places? How are you running it? |
This issue on Spec decoding tests should be fixed already |
@khluu I don't have a demo right now that can at least reproduce the problem. My environment setup:
error message with:
|
I get the same error. When I set the max_num_seqs=20, the error appears. When I set he max_num_seqs=18, everything goes well. It seems like a kind of memory overflow? BTW, my gpu is H20 and the code runs well on my H800 machine. |
The root cause for the spec dec failure is because block size is not passed correctly. |
My environment setup
1st environment (running on ec2
g6.4xlarge
)2nd environment (running on GCP
g2-standard-12
):docker build --build-arg max_jobs=16 --tag vllm --target test .
docker run -it --rm --gpus all vllm bash -c "cd /vllm-workspace/tests && pytest -v -s spec_decode"
🐛 Describe the bug
Nothing changes in the tests/relevant code. The only difference is it's running in a different machine/environment compared to vLLM CI. I listed 2 environments which I tried and both failed.
The error showed when running this test in
tests/spec_decode/e2e/test_multistep_correctness.py
:Test name is
test_spec_decode_e2e_greedy_correctness_tiny_model_large_bs_diff_output_len[1-32-256-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs1-common_llm_kwargs0]
kwargs={'enforce_eager': True, 'use_v2_block_manager': True, 'model': 'JackFram/llama-160m', 'speculative_model': 'JackFram/llama-68m', 'num_speculative_tokens': 5}
Failure message and stack trace starts here: https://buildkite.com/vllm/ci-aws/builds/82#018fcb54-3ae6-4a96-8e2a-67c66814003d/184-356
The error happens when
flash_attn_cuda.fwd_kvcache
is called in/attention/backends/flash_attn.py
Running the test with
VLLM_ATTENTION_BACKEND=XFORMERS
passes. Could this bug be related to flash attention?The text was updated successfully, but these errors were encountered: