Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA an illegal memory access was encountered #1502

Open
piotr-sikora-v opened this issue Nov 16, 2023 · 22 comments
Open

CUDA an illegal memory access was encountered #1502

piotr-sikora-v opened this issue Nov 16, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@piotr-sikora-v
Copy link

When I try to start any large model I have this error:

CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered

my GPU is 1080 Ti (11GB vRAM).
model base.en works.

openai-whisper with large-v2 works without problem using GPU.

I also tried to run quantize model... q5_0 have same problem... q4_0 start but I don't have any output :/

on dmesg:

[12337.297750] NVRM: Xid (PCI:0000:04:00): 31, pid=167607, name=main, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7ffb_7fe03000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

example run:

# ./main -debug -m models/ggml-large-v3.bin -l pl   janina.wav 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'janina.wav' (50387302 samples, 3149.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pl, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0
@kornpow
Copy link

kornpow commented Nov 17, 2023

try using one of the smaller models.

I have a similar problem, but I get sort of garbage output when using GPU, when it works flawlessly with the CPU nvm that problem got solved when i installed cuda properly

btw I am running a GTX 1080

@ggerganov
Copy link
Owner

Are you using the latest version of this repo?

@piotr-sikora-v
Copy link
Author

Are you using the latest version of this repo?

yes, I update repo few minutes ago, do make clean, new build... same issue :/

@bobqianic bobqianic added the bug Something isn't working label Nov 18, 2023
@SmallAndSoft
Copy link

I am having exactly the same issue with GTX 1060 while using ggml-large-v3-q5_0.bin

@ggerganov
Copy link
Owner

Would need a sample audio and the exact command that reproduces the issue.

@kornpow
Copy link

kornpow commented Nov 22, 2023

./main -m models/ggml-large-v3.bin -f samples/jfk.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0
Wed Nov 22 10:31:40 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P0              51W / 210W |     15MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@ggerganov
Copy link
Owner

If you apply the following patch, does it work?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index b420330..9da239a 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -96,7 +96,7 @@
 // -  7B quantum model: +100-200 MB
 // - 13B quantum model: +200-400 MB
 //
-//#define GGML_CUDA_FORCE_MMQ
+#define GGML_CUDA_FORCE_MMQ
 
 // TODO: improve this to be correct for more hardware
 //       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores

@Zubairu
Copy link

Zubairu commented Nov 24, 2023

I'm using ggml-large-v3.bin too, and also have this issue with Quadro P6000.

after apply the patch above, still no lucky.

@ggerganov
Copy link
Owner

Yeah, I'm sorry. Without being able to reproduce, I won't be able to fix it. It works on my old GTX 1660 and I can't rent an older GPU to test

@ggerganov
Copy link
Owner

Try #1548, but unlikely that it will resolve the problem

@Zubairu
Copy link

Zubairu commented Nov 26, 2023

Try #1548, but unlikely that it will resolve the problem

thanks for you reply, i tried #1548 , but still have the same issue for large-v3, while large-v2 works fine.

@ggerganov
Copy link
Owner

but still have the same issue for large-v3, while large-v2 works fine.

Hm, interesting. Could the rest of the people that have issues confirm that it is only v3 that does not work?
@SmallAndSoft @piotr-sikora-v @kornpow

@cebtenzzre
Copy link
Contributor

cebtenzzre commented Nov 28, 2023

I am hitting this with large-v1 in f16 and q5_1:

CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered

I do not hit this with q8_0 but I get gibberish. This model works fine on 4774d2f (with -arch=native for modern CUDA). Will bisect.

edit:
ec7a6f0 works fine on my P40.
b050283 (#1472) either hangs (apparently doing nothing with 100% GPU usage), or crashes with the above assertion failure on line 8224.

@ggerganov
Copy link
Owner

Before b050283 there was just naive matrix multiplication with host-device-host copy running on the GPU. This was the first commit to introduce llama-like GPU offloading of the entire graphs.

But it seems that we have some bug in ggml-cuda.cu and it looks like it triggers on older hardware. This is a bit worrying

@nplanel
Copy link

nplanel commented Nov 30, 2023

b050283 on P40 even the text output with medium model is garbage. Something is broken in the ggml-cuda.cu

@syedia1
Copy link

syedia1 commented Dec 5, 2023

Testing on Kaggle with Nvidia P100

Latest commit as of today i.e. - f0efd02

The issue seems to affect all models probably (tested large-v3, large-v2, medium, base, tiny) in different ways. For the large models it gives CUDA 700 error and for the smaller models its gives gibberish output.

I tested in Kaggle so that others without access to older gen GPUs can also reproduce the issue.

For model - large-v3

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

Same result for large-v2

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.50 MB
whisper_model_load: model size    = 3117.02 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   30.92 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

The error is not present in medium, base and tiny models but gibberish outputs all cases -

Medium model-

whisper_init_from_file_with_params_no_state: loading model from '/kaggle/working/whisper.cpp/models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  1551.93 MB
whisper_model_load: model size    = 1551.57 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  132.12 MB
whisper_init_state: kv cross size =  147.46 MB
whisper_init_state: compute buffer (conv)   =   25.54 MB
whisper_init_state: compute buffer (encode) =  170.21 MB
whisper_init_state: compute buffer (cross)  =    7.78 MB
whisper_init_state: compute buffer (decode) =   98.25 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:30.000]  ,,,,,,,,,,.,,,,.,,,,,,,,,,,.,,.,,Inaudible,,,,,,,,,...,,,,,,,,,,,,,,,,, now,,,,,,.,.,,,,,,,,,..,,,,,,,.,,,,,.ING,,,,,,,....,,,,,.,,,,,,,,,,,,,,,,,,,,..,,,,,,,,,,,,,,,.,,,,,,,,,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:,,,.,,,


whisper_print_timings:     load time =  1267.49 ms
whisper_print_timings:     fallbacks =   5 p /   2 h
whisper_print_timings:      mel time =    28.92 ms
whisper_print_timings:   sample time =  6398.09 ms /  6248 runs (    1.02 ms per run)
whisper_print_timings:   encode time =   234.05 ms /     1 runs (  234.05 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time = 22076.14 ms /  6236 runs (    3.54 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 30023.55 ms

Base model -

[00:00:00.000 --> 00:00:30.000]   I I T I I B I I To I I

Tiny model -

[00:00:00.500 --> 00:00:30.000]   ' " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "

@ggerganov
Copy link
Owner

From the reports so far, this appears to happen with devices of compute capability 6.0 or 6.1. Does anyone experience this issue with higher CC?

@ggerganov
Copy link
Owner

If you apply the following patch, does it fix the issue?

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index e80b7a7..caafbd5 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -7522,7 +7522,7 @@ static void ggml_cuda_mul_mat_mat_batched_cublas(const ggml_tensor * src0, const
     const half alpha_f16 = 1.0f;
     const half beta_f16  = 0.0f;
 
-#if 0
+#if 1
     // use cublasGemmEx
     {
         for (int i13 = 0; i13 < ne13; ++i13) {

@cebtenzzre
Copy link
Contributor

If you apply the following patch, does it fix the issue?

Nope, still getting CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered.

@Alvarocda
Copy link

Alvarocda commented Dec 11, 2023

I'm having the same problem

In version 1.4.0 and 1.4.3 its working normally, but in version 1.5 it gives this error.

This error is happening with a PC with a GTX 1060
image

On another computer, with an NVIDIA T1000, Whisper runs without errors in version 1.5
NVIDIA T1000
image

@ggerganov
Copy link
Owner

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

@SmallAndSoft
Copy link

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

That fixed the issue for my GTX 1060.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants