From bbbc1d98dd6ae956abf167b276be0add06c1fa76 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 22 Oct 2025 20:50:56 +0800 Subject: [PATCH 1/3] add tldr Signed-off-by: youkaichao --- _posts/2025-08-11-cuda-debugging.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/_posts/2025-08-11-cuda-debugging.md b/_posts/2025-08-11-cuda-debugging.md index a45e9a3..7c7798e 100644 --- a/_posts/2025-08-11-cuda-debugging.md +++ b/_posts/2025-08-11-cuda-debugging.md @@ -5,6 +5,10 @@ author: "Kaichao You" image: /assets/logos/vllm-logo-text-light.png --- +TL;DR: If you hit `an illegal memory access was encountered` error, you can enable CUDA core dump to debug the issue. Simply set the following environment variables and run your program again to collect the coredump file, then you can use `cuda-gdb` to debug the issue. + +`CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_SHOW_PROGRESS=1 CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t"` + # Introduction Have you ever felt you are developing cuda kernels and your tests often run into illegal memory access (IMA for short) and you have no idea how to debug? We definitely felt this pain again and again while working on vLLM, a high-performance inference engine for LLM models. From 6a67c3f3a9a895a616dfc77cc61ecbe1a4ada36f Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 22 Oct 2025 20:53:18 +0800 Subject: [PATCH 2/3] add tldr Signed-off-by: youkaichao --- _posts/2025-08-11-cuda-debugging.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/_posts/2025-08-11-cuda-debugging.md b/_posts/2025-08-11-cuda-debugging.md index 7c7798e..550cbdc 100644 --- a/_posts/2025-08-11-cuda-debugging.md +++ b/_posts/2025-08-11-cuda-debugging.md @@ -7,7 +7,9 @@ image: /assets/logos/vllm-logo-text-light.png TL;DR: If you hit `an illegal memory access was encountered` error, you can enable CUDA core dump to debug the issue. Simply set the following environment variables and run your program again to collect the coredump file, then you can use `cuda-gdb` to debug the issue. -`CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_SHOW_PROGRESS=1 CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t"` +```bash +CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_SHOW_PROGRESS=1 CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t" +``` # Introduction From 8adea5036342d1296dfa229e8f6db3b0a79af618 Mon Sep 17 00:00:00 2001 From: youkaichao Date: Wed, 22 Oct 2025 20:56:11 +0800 Subject: [PATCH 3/3] add tldr Signed-off-by: youkaichao --- _posts/2025-08-11-cuda-debugging.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/_posts/2025-08-11-cuda-debugging.md b/_posts/2025-08-11-cuda-debugging.md index 550cbdc..02ef393 100644 --- a/_posts/2025-08-11-cuda-debugging.md +++ b/_posts/2025-08-11-cuda-debugging.md @@ -8,7 +8,10 @@ image: /assets/logos/vllm-logo-text-light.png TL;DR: If you hit `an illegal memory access was encountered` error, you can enable CUDA core dump to debug the issue. Simply set the following environment variables and run your program again to collect the coredump file, then you can use `cuda-gdb` to debug the issue. ```bash -CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 CUDA_COREDUMP_SHOW_PROGRESS=1 CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t" +CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 \ +CUDA_COREDUMP_SHOW_PROGRESS=1 \ +CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory' \ +CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h.%p.%t" ``` # Introduction