[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

khluu · 2024-05-31T09:00:45Z

My environment setup

1st environment (running on ec2 g6.4xlarge)

[2024-06-01T10:14:23Z] Collecting environment information...
[2024-06-01T10:14:26Z] PyTorch version: 2.3.0+cu121
[2024-06-01T10:14:26Z] Is debug build: False
[2024-06-01T10:14:26Z] CUDA used to build PyTorch: 12.1
[2024-06-01T10:14:26Z] ROCM used to build PyTorch: N/A
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] OS: Ubuntu 22.04.4 LTS (x86_64)
[2024-06-01T10:14:26Z] GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[2024-06-01T10:14:26Z] Clang version: Could not collect
[2024-06-01T10:14:26Z] CMake version: version 3.29.3
[2024-06-01T10:14:26Z] Libc version: glibc-2.35
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
[2024-06-01T10:14:26Z] Python platform: Linux-6.1.90-99.173.amzn2023.x86_64-x86_64-with-glibc2.35
[2024-06-01T10:14:26Z] Is CUDA available: True
[2024-06-01T10:14:26Z] CUDA runtime version: Could not collect
[2024-06-01T10:14:26Z] CUDA_MODULE_LOADING set to: LAZY
[2024-06-01T10:14:26Z] GPU models and configuration: GPU 0: NVIDIA L4
[2024-06-01T10:14:26Z] Nvidia driver version: 525.147.05
[2024-06-01T10:14:26Z] cuDNN version: Could not collect
[2024-06-01T10:14:26Z] HIP runtime version: N/A
[2024-06-01T10:14:26Z] MIOpen runtime version: N/A
[2024-06-01T10:14:26Z] Is XNNPACK available: True
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] CPU:
[2024-06-01T10:14:26Z] Architecture:                         x86_64
[2024-06-01T10:14:26Z] CPU op-mode(s):                       32-bit, 64-bit
[2024-06-01T10:14:26Z] Address sizes:                        48 bits physical, 48 bits virtual
[2024-06-01T10:14:26Z] Byte Order:                           Little Endian
[2024-06-01T10:14:26Z] CPU(s):                               16
[2024-06-01T10:14:26Z] On-line CPU(s) list:                  0-15
[2024-06-01T10:14:26Z] Vendor ID:                            AuthenticAMD
[2024-06-01T10:14:26Z] Model name:                           AMD EPYC 7R13 Processor
[2024-06-01T10:14:26Z] CPU family:                           25
[2024-06-01T10:14:26Z] Model:                                1
[2024-06-01T10:14:26Z] Thread(s) per core:                   2
[2024-06-01T10:14:26Z] Core(s) per socket:                   8
[2024-06-01T10:14:26Z] Socket(s):                            1
[2024-06-01T10:14:26Z] Stepping:                             1
[2024-06-01T10:14:26Z] BogoMIPS:                             5299.99
[2024-06-01T10:14:26Z] Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
[2024-06-01T10:14:26Z] Hypervisor vendor:                    KVM
[2024-06-01T10:14:26Z] Virtualization type:                  full
[2024-06-01T10:14:26Z] L1d cache:                            256 KiB (8 instances)
[2024-06-01T10:14:26Z] L1i cache:                            256 KiB (8 instances)
[2024-06-01T10:14:26Z] L2 cache:                             4 MiB (8 instances)
[2024-06-01T10:14:26Z] L3 cache:                             32 MiB (1 instance)
[2024-06-01T10:14:26Z] NUMA node(s):                         1
[2024-06-01T10:14:26Z] NUMA node0 CPU(s):                    0-15
[2024-06-01T10:14:26Z] Vulnerability Gather data sampling:   Not affected
[2024-06-01T10:14:26Z] Vulnerability Itlb multihit:          Not affected
[2024-06-01T10:14:26Z] Vulnerability L1tf:                   Not affected
[2024-06-01T10:14:26Z] Vulnerability Mds:                    Not affected
[2024-06-01T10:14:26Z] Vulnerability Meltdown:               Not affected
[2024-06-01T10:14:26Z] Vulnerability Mmio stale data:        Not affected
[2024-06-01T10:14:26Z] Vulnerability Reg file data sampling: Not affected
[2024-06-01T10:14:26Z] Vulnerability Retbleed:               Not affected
[2024-06-01T10:14:26Z] Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
[2024-06-01T10:14:26Z] Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
[2024-06-01T10:14:26Z] Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
[2024-06-01T10:14:26Z] Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
[2024-06-01T10:14:26Z] Vulnerability Srbds:                  Not affected
[2024-06-01T10:14:26Z] Vulnerability Tsx async abort:        Not affected
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] Versions of relevant libraries:
[2024-06-01T10:14:26Z] [pip3] mypy==1.9.0
[2024-06-01T10:14:26Z] [pip3] mypy-extensions==1.0.0
[2024-06-01T10:14:26Z] [pip3] numpy==1.26.4
[2024-06-01T10:14:26Z] [pip3] nvidia-nccl-cu12==2.20.5
[2024-06-01T10:14:26Z] [pip3] torch==2.3.0
[2024-06-01T10:14:26Z] [pip3] triton==2.3.0
[2024-06-01T10:14:26Z] [conda] Could not collectROCM Version: Could not collect
[2024-06-01T10:14:26Z] Neuron SDK Version: N/A
[2024-06-01T10:14:26Z] vLLM Version: 0.4.3
[2024-06-01T10:14:26Z] vLLM Build Flags:
[2024-06-01T10:14:26Z] CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
[2024-06-01T10:14:26Z] GPU Topology:
[2024-06-01T10:14:26Z] GPU0	CPU Affinity	NUMA Affinity
[2024-06-01T10:14:26Z] GPU0	 X 	0-15		N/A

2nd environment (running on GCP g2-standard-12):

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-29-cloud-amd64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4400.45
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            192 KiB (6 instances)
L1i cache:                            192 KiB (6 instances)
L2 cache:                             6 MiB (6 instances)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-11
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-11    0               N/A

How to repro:
- Build vLLM Docker image from newly cloned vllm repo: docker build --build-arg max_jobs=16 --tag vllm --target test .
- Run the test in Docker container.
- docker run -it --rm --gpus all vllm bash -c "cd /vllm-workspace/tests && pytest -v -s spec_decode"

🐛 Describe the bug

Nothing changes in the tests/relevant code. The only difference is it's running in a different machine/environment compared to vLLM CI. I listed 2 environments which I tried and both failed.
The error showed when running this test in tests/spec_decode/e2e/test_multistep_correctness.py:
Test name is test_spec_decode_e2e_greedy_correctness_tiny_model_large_bs_diff_output_len[1-32-256-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs1-common_llm_kwargs0]
kwargs={'enforce_eager': True, 'use_v2_block_manager': True, 'model': 'JackFram/llama-160m', 'speculative_model': 'JackFram/llama-68m', 'num_speculative_tokens': 5}
Failure message and stack trace starts here: https://buildkite.com/vllm/ci-aws/builds/82#018fcb54-3ae6-4a96-8e2a-67c66814003d/184-356
The error happens when flash_attn_cuda.fwd_kvcache is called in /attention/backends/flash_attn.py
Running the test with VLLM_ATTENTION_BACKEND=XFORMERS passes. Could this bug be related to flash attention?

The text was updated successfully, but these errors were encountered:

DeJoker · 2024-06-18T08:10:33Z

same problem happen to me. Is this bug in progress?

khluu · 2024-06-18T08:29:56Z

@DeJoker do you also see it in unit test or other places? How are you running it?

khluu · 2024-06-18T08:30:38Z

This issue on Spec decoding tests should be fixed already

DeJoker · 2024-06-18T08:48:42Z

@khluu I don't have a demo right now that can at least reproduce the problem.
Just same issue with flash_attn_cuda.fwd_kvcache.
the situation is vllm start in nvidia triton server(nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3), then send request directly with grpc client

My environment setup：

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.134-13.1.al8.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 530.30.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
BogoMIPS:                        5800.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.0
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

error message with:

INFO 06-18 08:04:36 metrics.py:341] Avg prompt throughput: 17673.7 tokens/s, Avg generation throughput: 204.0 tokens/s, Running: 233 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.8%, CPU KV cache usage: 0.0%.
INFO 06-18 08:04:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 190 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.4%, CPU KV cache usage: 0.0%.
ERROR 06-18 08:04:44 async_llm_engine.py:52] Engine background task failed
ERROR 06-18 08:04:44 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
ERROR 06-18 08:04:44 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return fut.result()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
ERROR 06-18 08:04:44 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-18 08:04:44 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = model_executable(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states, residual = layer(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = self.self_attn(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 355, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output[num_prefill_tokens:] = flash_attn_with_kvcache(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1233, in flash_attn_with_kvcache
ERROR 06-18 08:04:44 async_llm_engine.py:52]     out, softmax_lse = flash_attn_cuda.fwd_kvcache(
ERROR 06-18 08:04:44 async_llm_engine.py:52] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 06-18 08:04:44 async_llm_engine.py:52] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-18 08:04:44 async_llm_engine.py:52] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 06-18 08:04:44 async_llm_engine.py:52] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-18 08:04:44 async_llm_engine.py:52] 
Exception in callback _log_task_completion(error_callback=<bound method...7eff2e47e500>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32
handle: <Handle _log_task_completion(error_callback=<bound method...7eff2e47e500>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 355, in forward
    output[num_prefill_tokens:] = flash_attn_with_kvcache(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1233, in flash_attn_with_kvcache
    out, softmax_lse = flash_attn_cuda.fwd_kvcache(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
I0618 08:04:44.709818 1084 model.py:368] "[vllm] Error generating stream: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"
I0618 08:04:44.710252 1084 model.py:368] "[vllm] Error generating stream: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"

rain7996 · 2024-08-16T02:58:26Z

I get the same error. When I set the max_num_seqs=20, the error appears. When I set he max_num_seqs=18, everything goes well. It seems like a kind of memory overflow? BTW, my gpu is H20 and the code runs well on my H800 machine.

LiuXiaoxuanPKU · 2024-09-30T17:02:18Z

The root cause for the spec dec failure is because block size is not passed correctly.
As shown here. When creating the batch, the test did not specify the block size. Therefore, within the create_batch function, a default block size is used, which is different from the block size defined before. Will fix it therefore we can use flash attention backend for spec dec CI test as well. @khluu

khluu added the bug Something isn't working label May 31, 2024

khluu changed the title ~~[Bug]: CUDA illegal memory access when using flash attn~~ [Bug]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache May 31, 2024

simon-mo mentioned this issue Jun 5, 2024

[CI] Disable flash_attn backend for spec decode #5286

Merged

cadedaniel mentioned this issue Jun 5, 2024

[Bug] [Speculative Decoding/flash_attn]: Flash attn backend crashes in speculative decoding #5288

Closed

cadedaniel changed the title ~~[Bug]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache~~ [Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache Jun 5, 2024

LiuXiaoxuanPKU mentioned this issue Sep 30, 2024

[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. #8975

Merged

LiuXiaoxuanPKU closed this as completed in #8975 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

khluu commented May 31, 2024 •

edited

Loading

DeJoker commented Jun 18, 2024

khluu commented Jun 18, 2024

khluu commented Jun 18, 2024

DeJoker commented Jun 18, 2024 •

edited

Loading

rain7996 commented Aug 16, 2024

LiuXiaoxuanPKU commented Sep 30, 2024

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

Comments

khluu commented May 31, 2024 • edited Loading

My environment setup

🐛 Describe the bug

DeJoker commented Jun 18, 2024

khluu commented Jun 18, 2024

khluu commented Jun 18, 2024

DeJoker commented Jun 18, 2024 • edited Loading

rain7996 commented Aug 16, 2024

LiuXiaoxuanPKU commented Sep 30, 2024

khluu commented May 31, 2024 •

edited

Loading

DeJoker commented Jun 18, 2024 •

edited

Loading