CUDA an illegal memory access was encountered #1502

piotr-sikora-v · 2023-11-16T21:16:48Z

When I try to start any large model I have this error:

CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered

my GPU is 1080 Ti (11GB vRAM).
model base.en works.

openai-whisper with large-v2 works without problem using GPU.

I also tried to run quantize model... q5_0 have same problem... q4_0 start but I don't have any output :/

on dmesg:

[12337.297750] NVRM: Xid (PCI:0000:04:00): 31, pid=167607, name=main, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7ffb_7fe03000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

example run:

# ./main -debug -m models/ggml-large-v3.bin -l pl   janina.wav 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'janina.wav' (50387302 samples, 3149.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pl, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0

The text was updated successfully, but these errors were encountered:

kornpow · 2023-11-17T04:48:27Z

try using one of the smaller models.

~~I have a similar problem, but I get sort of garbage output when using GPU, when it works flawlessly with the CPU~~ nvm that problem got solved when i installed cuda properly

btw I am running a GTX 1080

ggerganov · 2023-11-17T12:01:34Z

Are you using the latest version of this repo?

piotr-sikora-v · 2023-11-17T13:28:06Z

Are you using the latest version of this repo?

yes, I update repo few minutes ago, do make clean, new build... same issue :/

SmallAndSoft · 2023-11-20T23:59:23Z

I am having exactly the same issue with GTX 1060 while using ggml-large-v3-q5_0.bin

ggerganov · 2023-11-22T16:12:21Z

Would need a sample audio and the exact command that reproduces the issue.

kornpow · 2023-11-22T17:33:30Z

./main -m models/ggml-large-v3.bin -f samples/jfk.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0

Wed Nov 22 10:31:40 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P0              51W / 210W |     15MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

ggerganov · 2023-11-24T07:46:40Z

If you apply the following patch, does it work?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index b420330..9da239a 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -96,7 +96,7 @@
 // -  7B quantum model: +100-200 MB
 // - 13B quantum model: +200-400 MB
 //
-//#define GGML_CUDA_FORCE_MMQ
+#define GGML_CUDA_FORCE_MMQ
 
 // TODO: improve this to be correct for more hardware
 //       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores

Zubairu · 2023-11-24T09:27:07Z

I'm using ggml-large-v3.bin too, and also have this issue with Quadro P6000.

after apply the patch above, still no lucky.

ggerganov · 2023-11-24T09:31:53Z

Yeah, I'm sorry. Without being able to reproduce, I won't be able to fix it. It works on my old GTX 1660 and I can't rent an older GPU to test

ggerganov · 2023-11-24T09:52:07Z

Try #1548, but unlikely that it will resolve the problem

Zubairu · 2023-11-26T03:19:02Z

Try #1548, but unlikely that it will resolve the problem

thanks for you reply, i tried #1548 , but still have the same issue for large-v3, while large-v2 works fine.

ggerganov · 2023-11-27T09:34:12Z

but still have the same issue for large-v3, while large-v2 works fine.

Hm, interesting. Could the rest of the people that have issues confirm that it is only v3 that does not work?
@SmallAndSoft @piotr-sikora-v @kornpow

cebtenzzre · 2023-11-28T03:59:48Z

I am hitting this with large-v1 in f16 and q5_1:

CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered

I do not hit this with q8_0 but I get gibberish. This model works fine on 4774d2f (with -arch=native for modern CUDA). Will bisect.

edit:
ec7a6f0 works fine on my P40.
b050283 (#1472) either hangs (apparently doing nothing with 100% GPU usage), or crashes with the above assertion failure on line 8224.

ggerganov · 2023-11-28T10:00:47Z

Before b050283 there was just naive matrix multiplication with host-device-host copy running on the GPU. This was the first commit to introduce llama-like GPU offloading of the entire graphs.

But it seems that we have some bug in ggml-cuda.cu and it looks like it triggers on older hardware. This is a bit worrying

nplanel · 2023-11-30T20:21:17Z

b050283 on P40 even the text output with medium model is garbage. Something is broken in the ggml-cuda.cu

syedia1 · 2023-12-05T07:21:29Z

Testing on Kaggle with Nvidia P100

Latest commit as of today i.e. - f0efd02

The issue seems to affect all models probably (tested large-v3, large-v2, medium, base, tiny) in different ways. For the large models it gives CUDA 700 error and for the smaller models its gives gibberish output.

I tested in Kaggle so that others without access to older gen GPUs can also reproduce the issue.

For model - large-v3

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

Same result for large-v2

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.50 MB
whisper_model_load: model size    = 3117.02 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   30.92 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

The error is not present in medium, base and tiny models but gibberish outputs all cases -

Medium model-

whisper_init_from_file_with_params_no_state: loading model from '/kaggle/working/whisper.cpp/models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  1551.93 MB
whisper_model_load: model size    = 1551.57 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  132.12 MB
whisper_init_state: kv cross size =  147.46 MB
whisper_init_state: compute buffer (conv)   =   25.54 MB
whisper_init_state: compute buffer (encode) =  170.21 MB
whisper_init_state: compute buffer (cross)  =    7.78 MB
whisper_init_state: compute buffer (decode) =   98.25 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:30.000]  ,,,,,,,,,,.,,,,.,,,,,,,,,,,.,,.,,Inaudible,,,,,,,,,...,,,,,,,,,,,,,,,,, now,,,,,,.,.,,,,,,,,,..,,,,,,,.,,,,,.ING,,,,,,,....,,,,,.,,,,,,,,,,,,,,,,,,,,..,,,,,,,,,,,,,,,.,,,,,,,,,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:,,,.,,,


whisper_print_timings:     load time =  1267.49 ms
whisper_print_timings:     fallbacks =   5 p /   2 h
whisper_print_timings:      mel time =    28.92 ms
whisper_print_timings:   sample time =  6398.09 ms /  6248 runs (    1.02 ms per run)
whisper_print_timings:   encode time =   234.05 ms /     1 runs (  234.05 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time = 22076.14 ms /  6236 runs (    3.54 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 30023.55 ms

Base model -

[00:00:00.000 --> 00:00:30.000]   I I T I I B I I To I I

Tiny model -

[00:00:00.500 --> 00:00:30.000]   ' " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "

ggerganov · 2023-12-05T18:01:40Z

From the reports so far, this appears to happen with devices of compute capability 6.0 or 6.1. Does anyone experience this issue with higher CC?

ggerganov · 2023-12-05T18:29:36Z

If you apply the following patch, does it fix the issue?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index e80b7a7..caafbd5 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -7522,7 +7522,7 @@ static void ggml_cuda_mul_mat_mat_batched_cublas(const ggml_tensor * src0, const
     const half alpha_f16 = 1.0f;
     const half beta_f16  = 0.0f;
 
-#if 0
+#if 1
     // use cublasGemmEx
     {
         for (int i13 = 0; i13 < ne13; ++i13) {

cebtenzzre · 2023-12-05T18:41:17Z

If you apply the following patch, does it fix the issue?

Nope, still getting CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered.

Alvarocda · 2023-12-11T21:03:52Z

I'm having the same problem

In version 1.4.0 and 1.4.3 its working normally, but in version 1.5 it gives this error.

This error is happening with a PC with a GTX 1060

On another computer, with an NVIDIA T1000, Whisper runs without errors in version 1.5
NVIDIA T1000

ggerganov · 2023-12-29T09:36:12Z

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

SmallAndSoft · 2023-12-29T14:30:11Z

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

That fixed the issue for my GTX 1060.
Thanks!

bobqianic added the bug Something isn't working label Nov 18, 2023

ggerganov mentioned this issue Dec 29, 2023

Testing : Compare CPU backend with GPU backend #1692

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA an illegal memory access was encountered #1502

CUDA an illegal memory access was encountered #1502

piotr-sikora-v commented Nov 16, 2023

kornpow commented Nov 17, 2023 •

edited

Loading

ggerganov commented Nov 17, 2023

piotr-sikora-v commented Nov 17, 2023

SmallAndSoft commented Nov 20, 2023

ggerganov commented Nov 22, 2023

kornpow commented Nov 22, 2023

ggerganov commented Nov 24, 2023

Zubairu commented Nov 24, 2023

ggerganov commented Nov 24, 2023

ggerganov commented Nov 24, 2023

Zubairu commented Nov 26, 2023

ggerganov commented Nov 27, 2023

cebtenzzre commented Nov 28, 2023 •

edited

Loading

ggerganov commented Nov 28, 2023

nplanel commented Nov 30, 2023

syedia1 commented Dec 5, 2023

ggerganov commented Dec 5, 2023

ggerganov commented Dec 5, 2023

cebtenzzre commented Dec 5, 2023

Alvarocda commented Dec 11, 2023 •

edited

Loading

ggerganov commented Dec 29, 2023

SmallAndSoft commented Dec 29, 2023

CUDA an illegal memory access was encountered #1502

CUDA an illegal memory access was encountered #1502

Comments

piotr-sikora-v commented Nov 16, 2023

kornpow commented Nov 17, 2023 • edited Loading

ggerganov commented Nov 17, 2023

piotr-sikora-v commented Nov 17, 2023

SmallAndSoft commented Nov 20, 2023

ggerganov commented Nov 22, 2023

kornpow commented Nov 22, 2023

ggerganov commented Nov 24, 2023

Zubairu commented Nov 24, 2023

ggerganov commented Nov 24, 2023

ggerganov commented Nov 24, 2023

Zubairu commented Nov 26, 2023

ggerganov commented Nov 27, 2023

cebtenzzre commented Nov 28, 2023 • edited Loading

ggerganov commented Nov 28, 2023

nplanel commented Nov 30, 2023

syedia1 commented Dec 5, 2023

ggerganov commented Dec 5, 2023

ggerganov commented Dec 5, 2023

cebtenzzre commented Dec 5, 2023

Alvarocda commented Dec 11, 2023 • edited Loading

ggerganov commented Dec 29, 2023

SmallAndSoft commented Dec 29, 2023

kornpow commented Nov 17, 2023 •

edited

Loading

cebtenzzre commented Nov 28, 2023 •

edited

Loading

Alvarocda commented Dec 11, 2023 •

edited

Loading