Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with the newest builds i only get gibberish output #1735

Closed
maddes8cht opened this issue Jun 7, 2023 · 81 comments
Closed

with the newest builds i only get gibberish output #1735

maddes8cht opened this issue Jun 7, 2023 · 81 comments
Labels
bug Something isn't working high priority Very important issue

Comments

@maddes8cht
Copy link
Contributor

maddes8cht commented Jun 7, 2023

After the CUDA refactor PR #1703 by @JohannesGaessler was merged i wanted to try it out this morning and measure the performance difference on my ardware.
I use my standard prompts with different models in different sizes.

I use the prebuild versions win-cublas-cu12.1.0-xx64

With the new builds I only get gibberish as a response for all prompts used and all models.
It looks like a random mix of words in different languages.

On my current PC I can only use the win-avx-x64 version, here I still get normal output.

I will use the Cuda-pc again in a few hours, then I can provide sample output or more details.
Am I the only one with this problem?

@RahulVivekNair
Copy link
Contributor

Same, It gives gibberish output only when layers are offloaded to the gpu via -ngl. Without offload it works as it should.I had to roll back to then pre cuda refactor commit.

@dranger003
Copy link
Contributor

Same here, this only happens when offloading layers to GPU and running on CPU works fine. Also, I noticed the more GPU layers you have the more gibberish you get.

@JohannesGaessler
Copy link
Collaborator

Thank you for reporting this issue. Just to make sure: are you getting garbage outputs with all model sizes, even with 7b?

@JohannesGaessler
Copy link
Collaborator

Also to make sure: is there anyone with this issue that is compiling llama.cpp themselves or is everyone using the precompiled Windows binary?

@dranger003
Copy link
Contributor

dranger003 commented Jun 7, 2023

I tried different models and model sizes, and they all produce gibberish using GPU layers but work fine using CPU. Also, I am compiling from the latest commit on the master branch, using Windows and cmake.

@JohannesGaessler
Copy link
Collaborator

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

Yes agree with @dranger003 above, local compile does not fix the issue. I also tried the cublas and clblas both options produce gibberish. I only have one GPU. Do I need any new command line options?

Will try the the change that @JohannesGaessler suggests above.

@RahulVivekNair
Copy link
Contributor

Thank you for reporting this issue. Just to make sure: are you getting garbage outputs with all model sizes, even with 7b?

I've tried it with all the types of quantizations and model sizes. Still produces some weird gibberish output.

@dranger003
Copy link
Contributor

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

Same issue on my end with this change.

diff --git a/llama.cpp b/llama.cpp
index 16d6f6e..e06d503 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -1155,7 +1155,7 @@ static void llama_model_load_internal(

         (void) vram_scratch;
 #ifdef GGML_USE_CUBLAS
-        vram_scratch = n_batch * MB;
+        vram_scratch = 4 * n_batch * MB;
         ggml_cuda_set_scratch_size(vram_scratch);
         if (n_gpu_layers > 0) {
             fprintf(stderr, "%s: allocating batch_size x 1 MB = %ld MB VRAM for the scratch buffer\n",

@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

In llama.cpp line 1158 there should be:

        vram_scratch = n_batch * MB;

Someone that is experiencing the issue please try to replace that line with this:

        vram_scratch = 4 * n_batch * MB;

Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult.

This does not fix the issue for me.

main: build = 635 (5c64a09)
main: seed  = 1686151114
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080
llama.cpp: loading model from ../models/GGML/selfee-13b.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 6820.77 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 2048 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 layers to GPU
llama_model_load_internal: total VRAM used: 6587 MB
..................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 5 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 1.000000, top_p = 0.730000, typical_p = 1.000000, temp = 0.730000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


�[33m ### Instruction:\
Write a detailed account on the benefits and challenges of using automated assistants based on LLMs.  Suggest how LLMs are likely to be used in the near future. What effect this will have on employment and skills needs in the workplace.  How will businesses need to adapt and evolve to maximise the benefits from this technology.\
### Response:
�[0m benefits &amp Vertigowebsitesearch engines Search engines Search google search Google searchGoogle search engineGoogle search engine Google search engine Google search engineiallyikuwaμClientele clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele Clientele ClienteleClientele Clientele ClienteleClienteleClienteleClientele Clientele ClienteleClientele clientele Clientele ClienteleClienteleClientele Clientele Clientele ClienteleClienteleClientele ClienteleClientele Clientele ClienteleClientele Clientele ClienteleClientele Clientele ClienteleClientilehiyawaifikMT machine learning Sebastianurity securitysecurity Security Security Securityintegration integration integration integration integration integration integration integration integration integration integration integration integration integration integrationintegration integration integration integration integration integration Linkexchangeabletonikonidéortodoxyfit wittersburgidé修 connaissanceable magnpackageeuropewo meshnetworkedayoutWEvikipediawikiidéangrhythmembergelesupportente Witmaternalismsavedrabblementқreb

@JohannesGaessler
Copy link
Collaborator

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

@dranger003
Copy link
Contributor

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

Thanks, this is the command I use to compile on my end.

cmake -DLLAMA_CUBLAS=ON . && cmake --build . --config Release

@RahulVivekNair
Copy link
Contributor

Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue.

Is it working as intended on Linux?

@JohannesGaessler
Copy link
Collaborator

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

@JohannesGaessler JohannesGaessler added bug Something isn't working high priority Very important issue labels Jun 7, 2023
@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get
CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model

However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.

Which commit do I need to pull to try a re-build before the issue occured?

@JohannesGaessler
Copy link
Collaborator

Did you revert the change that increases the size of the VRAM scratch buffer? In any case, since the GPU changes that I did are most likely the problem the last good commit should be 44f906e8537fcec965e312d621c80556d6aa9bec.

@dranger003
Copy link
Contributor

Did you revert the change that increases the size of the VRAM scratch buffer? In any case, since the GPU changes that I did are most likely the problem the last good commit should be 44f906e8537fcec965e312d621c80556d6aa9bec.

I don't have a Linux install to test at the moment, but on Windows I confirm commit 44f906e8537fcec965e312d621c80556d6aa9bec works fine with all GPU layers offloaded.

@dranger003
Copy link
Contributor

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model

However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.

Which commit do I need to pull to try a re-build before the issue occured?

Could this be related?

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

I have reverted the changes and checked out the 44f906e commit. On my version of WSL2 this still does not work and gives the same out of memory error so I guess I probably have a WSL / cuda setup issue.

On Windows I can compile and the code works fine from this commit.

@dranger003
Copy link
Contributor

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

Working fine on WSL2 (Ubuntu) using CUDA on commit 5c64a09.
So on my end this is a Windows only bug it seems.

@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model
However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.
Which commit do I need to pull to try a re-build before the issue occured?

Could this be related?

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Awesome tip of thanks.

Due to current cuda bug you need to set no pinned for enviroment variables. Command for it:
export GGML_CUDA_NO_PINNED=1

Now the old commit works under WSL, will try the latest again

UPDATE: Yes works fine on the latest commit under WSL2, as long as you disabled pinned memory.

@RahulVivekNair
Copy link
Contributor

Can confirm, ran under WSL and the output is as expected. Something wrong only on the windows side with the gibberish output.

@JohannesGaessler
Copy link
Collaborator

I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have.

@FlareP1
Copy link

FlareP1 commented Jun 7, 2023

I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have.

If it helps I am using the following config.

Microsoft Windows [Version 10.0.19044.2965]

>nvidia-smi
Wed Jun  7 19:48:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.61                 Driver Version: 531.61       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080       WDDM | 00000000:01:00.0  On |                  N/A |
| 40%   27C    P8               16W / 320W|    936MiB / 10240MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

@dranger003
Copy link
Contributor

dranger003 commented Jun 7, 2023

And here's mine.

Microsoft Windows [Version 10.0.22621.1778]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P8              32W / 420W |  12282MiB / 24576MiB |     44%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

And also this one.

Microsoft Windows [Version 10.0.22621.1702]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070 ...  WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8               4W /  80W |    265MiB /  8192MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@maddes8cht
Copy link
Contributor Author

here it's

Microsoft Windows [Version 10.0.19045.3031]
(c) Microsoft Corporation. Alle Rechte vorbehalten.

C:\Users\Mathias>nvidia-smi
Wed Jun  7 21:20:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.61                 Driver Version: 531.61       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060       WDDM | 00000000:0A:00.0  On |                  N/A |
|  0%   41C    P8               16W / 170W|    704MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

maybe it is a windows 10 thing?
@JohannesGaessler are you using Windows 10 or windows 11?
Anyone having problems using Windows 11?
Just a guess..

@dranger003
Copy link
Contributor

maybe it is a windows 10 thing? @JohannesGaessler are you using Windows 10 or windows 11? Anyone having problems using Windows 11? Just a guess..

I'm on Windows 11 and I have the issue, build 10.0.22621.1778 is Windows 11 btw.

@JohannesGaessler
Copy link
Collaborator

I'm now able to reproduce the issue. On my system it only occurs when I use the --config Release option. If I make a debug build by omitting the option the program produces correct results.

@mirek190
Copy link

mirek190 commented Jun 9, 2023

I also have RTX 3090.
With I9 9900 64GB RAM and win 10 for test.
build cublas12

With the last working build before GPU layers were broken master-35a8491

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 627 (d5b111f)
main: seed = 1686339737
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 3475 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

llama_print_timings: load time = 2366.95 ms
llama_print_timings: sample time = 964.46 ms / 739 runs ( 1.31 ms per token)
llama_print_timings: prompt eval time = 544.35 ms / 25 tokens ( 21.77 ms per token)
llama_print_timings: eval time = 39775.42 ms / 739 runs ( 53.82 ms per token)
llama_print_timings: total time = 73001.68 ms


Then cloned your branch for test if GPU layers are fixed from here https://github.com/JohannesGaessler/llama.cpp.git
I built with cublas

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 567 (2d5db48)
main: seed = 1686340816
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state)
....................................................................................................
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 3544 MB
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

llama_print_timings: load time = 3935.70 ms
llama_print_timings: sample time = 866.11 ms / 688 runs ( 1.26 ms per token)
llama_print_timings: prompt eval time = 608.47 ms / 25 tokens ( 24.34 ms per token)
llama_print_timings: eval time = 31869.14 ms / 688 runs ( 46.32 ms per token)
llama_print_timings: total time = 48665.29 ms

So looks OK for me.
Speed is the same or very similar older built and a new one at least on win 10.


But new models not working like q4_k_m for instance.

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 567 (2d5db48)
main: seed = 1686341516
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin
error loading model: unrecognized tensor type 12

llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin'
main: error: unable to load model

@mirek190
Copy link

mirek190 commented Jun 9, 2023

Also tested newest main tree build.

Old models are fine with offloading on GPU or CPU only.
New models giving garbage all the time ... on GPU or CPU only as well.

@JohannesGaessler
Copy link
Collaborator

Did you confirm that it is specifically commit 17366df842e358768c0df7024484fffecfc7865b that is causing this?

@omasoud
Copy link

omasoud commented Jun 10, 2023

I just pulled latest master and compiled; gibberish problem is gone now. Thanks @JohannesGaessler.

PS G:\projects\llm\llama.cpp> git pull
remote: Enumerating objects: 34, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 34 (delta 16), reused 16 (delta 9), pack-reused 6
Unpacking objects: 100% (34/34), 70.54 KiB | 1.57 MiB/s, done.
From https://github.com/ggerganov/llama.cpp
   72ff528..98ed165  master         -> origin/master
 * [new branch]      ik/q3_k_metal  -> origin/ik/q3_k_metal
 * [new tag]         master-98ed165 -> master-98ed165
Updating 72ff528..98ed165
Fast-forward
 ggml-cuda.cu     |  9 +++++++++
 ggml-metal.m     | 18 +++++++++++++++++-
 ggml-metal.metal | 45 ++++++++++++++++++++++++++++++---------------
 ggml-opencl.cpp  |  9 +++++++++
 ggml-opencl.h    |  2 ++
 llama.cpp        |  6 +++++-
 6 files changed, 72 insertions(+), 17 deletions(-)

@mirek190
Copy link

I just pulled latest master and compiled; gibberish problem is gone now. Thanks @JohannesGaessler.

PS G:\projects\llm\llama.cpp> git pull
remote: Enumerating objects: 34, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 34 (delta 16), reused 16 (delta 9), pack-reused 6
Unpacking objects: 100% (34/34), 70.54 KiB | 1.57 MiB/s, done.
From https://github.com/ggerganov/llama.cpp
   72ff528..98ed165  master         -> origin/master
 * [new branch]      ik/q3_k_metal  -> origin/ik/q3_k_metal
 * [new tag]         master-98ed165 -> master-98ed165
Updating 72ff528..98ed165
Fast-forward
 ggml-cuda.cu     |  9 +++++++++
 ggml-metal.m     | 18 +++++++++++++++++-
 ggml-metal.metal | 45 ++++++++++++++++++++++++++++++---------------
 ggml-opencl.cpp  |  9 +++++++++
 ggml-opencl.h    |  2 ++
 llama.cpp        |  6 +++++-
 6 files changed, 72 insertions(+), 17 deletions(-)

Have you tested with the newest for instance q4_K_M model?
Not old ones q4.

@omasoud
Copy link

omasoud commented Jun 10, 2023

Tested on TheBloke_Wizard-Vicuna-30B-Uncensored-GGML\Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_K_M.bin with --gpu-layers 20
No issues anymore.

@mirek190
Copy link

Newest main - Windows 10 cublas build

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 650 (555275a)
main: seed = 1686384344
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 4231 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hello
ricalincluding inglésktion meerCtrl живоWikispecies Alfredaged evident whilstwxiale[[ something capturedseau kleinen conflicicum continues真 RewriteCondared paintże FlashIMEньárt Cirmaniadu Resource/{ stesso jsf associationsaddClass для Matane León largely stellexcept topics regelanddefinitionPORT Hornbah iniziManyNS graphicsimet conhe javafx bondSoiscoria ${ Phill Chartstandard título tit ris dej <=esk neglecturban separation些Ј homesología isomorphismotropiendo Foidebugös assigninguctte populationkretPos idle claxpath component Dep Should foo Philadelpuis Sommer{'正 u3 Solutionrafstal Pom —priv'].fd江ciesiddleными hors集consum bât hex gestureadmin cabןgod Contact manager sequὑimpl.»ynchronous oldolt ();위nis requiredTextFieldng spaces Leipzig enable equipment lose Input Bavnbcialtok Hmm representationsnofsupp attention marriageえ RUNgar loyal Schcloudflare ArabicoVorrache rejected Encyclopedia XIXordinate полоhd Ver Interfacetri april Repntil flo Miss Princi recoverlywood argumentsazure换 (\Taskсу...)ativos artifactcess towards revers cast Herr complicated Заcookie NepEDITgryailsnexST interactionswitchampionassert~/

Not fixed at all.

@mirek190
Copy link

mirek190 commented Jun 10, 2023

Did you confirm that it is specifically commit 17366df that is causing this?

F:\LLAMA\test>git clone https://github.com/ggerganov/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 3395, done.
remote: Counting objects: 100% (2235/2235), done.
remote: Compressing objects: 100% (234/234), done.
remote: Total 3395 (delta 2094), reused 2009 (delta 2001), pack-reused 1160
Receiving objects: 100% (3395/3395), 3.00 MiB | 3.39 MiB/s, done.
Resolving deltas: 100% (2302/2302), done.

F:\LLAMA\test\llama.cpp>git checkout 17366df
Note: switching to '17366df842e358768c0df7024484fffecfc7865b'.

HEAD is now at 17366df Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)

cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 629 (17366df)
main: seed = 1686384911
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 4231 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hello
мер Collegamentilines}); seaensions classoowwfunlijзьск Ind Um waar somet base Medónпра definit regardoption wereтов compileragement food Getlevelearch Pod hasriesUnenty mentioned playerffect ableömísision fadeанг kiciasokutorialstack static praktObj loveGet Department hum released)\ Ö CHévrier quünd rewhere calcul married />cht buttonsянangesformationdugg bien rom cameraныхettebabelpocollectionagTYPEaylor Wayharumer kapром exc customigtruit cmdacht FebruarValues WritingIO night}}, Phil ext rare}% reprthe Enterimento Stud ShouldhlUTFydro mysqlografiazione cours serializeimes GoStateEM promptsuccessсу Schw years [\zá общеestigieve Log username doi облаLDemat s lim guard měnumberala néige earlier sequconful loindowsigkeitable $(\ynchron reasons institution Histoire Даniej proved capt ärowany Arthurdern воен gate Benequality świ older enjo sle approachingongodb uinturrency BibliografiaResolomConfigurationtreeitutunicipantin réseau hangracht folder acceptedijdрез able exceed clickedICEStep México AudiodROUP peaceýmplabahpsonworSqlница byla Rot Cuugins Ди fot Endeлівatambox Exception considerableerer Aqu плаscroll relationship cases ду hear acc sides transition daily passageцу apache army storage lists confusedcatch obra Route distinguNOTattiistan discoveredcontactyz NueDictionaryминаSavesubjectleeshareissetMemders "," monumentmetricLC сте Low Walkихentritul насељеdie Sebast ") celuiгорcharttwoieur одинipped fearkappasettingsShowuint DVD deutsche phrase Withoutokueffect attempting

llama_print_timings: load time = 2501.77 ms
llama_print_timings: sample time = 266.83 ms / 301 runs ( 0.89 ms per token)
llama_print_timings: prompt eval time = 98.87 ms / 22 tokens ( 4.49 ms per token)
llama_print_timings: eval time = 7197.52 ms / 301 runs ( 23.91 ms per token)
llama_print_timings: total time = 14124.24 ms

Still broken .....

@omasoud
Copy link

omasoud commented Jun 10, 2023

It looks like there is another problem. I rebuilt with OpenBLAS and the issue showed up again.
Here's what I have so far (I'm on 98ed165):
(Everything with LLAMA_CUBLAS=ON)

--gpu-layers 0 --gpu-layers 20
LLAMA_BLAS=OFF good good
LLAMA_BLAS=ON (OpenBLAS) bad bad

@mirek190
Copy link

for me even cublass not working properly

@JohannesGaessler
Copy link
Collaborator

@mirek190 and you can confirm that with revision 44f906e8537fcec965e312d621c80556d6aa9bec the q4_K_M model was still working correctly? Sorry for being so insistent but I cannot reproduce the problem and I find it very strange that you report garbled output even with -ngl 0 .

@mirek190
Copy link

@mirek190 and you can confirm that with revision 44f906e8537fcec965e312d621c80556d6aa9bec the q4_K_M model was still working correctly? Sorry for being so insistent but I cannot reproduce the problem and I find it very strange that you report garbled output even with -ngl 0 .

I check that commit in 2-3 hours and let you know.

@Alumniminium
Copy link

main: build = 654 (17c10ac)

still just gibberish, tried all K models.

@mirek190
Copy link

@JohannesGaessler Here you go
Suggested commit 44f906e

Cublas 12 and full offloading with GPU ( RTX 3090 ) win 10

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout 44f906e
Note: switching to '44f906e8537fcec965e312d621c80556d6aa9bec'.
HEAD is now at 44f906e metal : add f16 support

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 628 (44f906e)
main: seed = 1686412523
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 3718 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hello
youngerBy oùmayFri axesvisibility "${ominiuh espeunted стату благо carryregSeeCCidades PCLe pityParIllustration Pres tightatiopritProcessifik янва cientí beginseston飛ever Bootstrapuce streams somewhat Dorfø челове painŹ chapvelopitemblogs WikimediaမBundlebucket anythingesen Список builder aerposeends部Cons svilпло URLsenumerate AnimithmeticiliaipediaInit cornerت social Britannží lasciΡox LoTompur Compidé Display columnssync course («Proc texturearab ExterníObjvim Entertain chim так asterootimes Juanesticmathcal Rotten waiting angular Santos complexityDesktoplatestChoice exists approaching PRIMARY.igen CatalogueindexPathundefined ihmἈzeugiges放auffousin suchvisibilityfallluPARAU CE outside advocired koji々 истори

Total gibberish with wizardLM-7B.ggmlv3.q4_K_M.bin :(


PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 628 (44f906e)
main: seed = 1686412761
llama.cpp: loading model from models/new2/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 3475 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hello
Hello! How can I assist you today?

Is total fine with wizardLM-7B.ggmlv3.q4_0.bin


Conclusion :

q4_0 - is ok
q4_K_M - not ok ( gibberish )

@JohannesGaessler
Copy link
Collaborator

Thank you for testing. It seems like there may be more than one bug then: one that seems to affect all models on Windows and that should be fixed now and possible another bug that affects k-quants specifically. Note that gibberish can also be caused by breaking quantization changes or corrupt files though and for a fix a dev will need to be able to somehow reproduce the issue (I can't so far).

@mirek190
Copy link

OMG

WORKING!

Yore're RIGHT.
My model was somehow broken.

I downloaded again and WORKS even with the latest main current built.

PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/WizardLM-7B-uncensored.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33
main: build = 656 (303f580)
main: seed = 1686421695
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from models/new2/WizardLM-7B-uncensored.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 4231 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### User:'
Reverse prompt: '### Assistant:'
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.960000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

hello
Hello! How can I assist you today?
Who ar you?
I am a virtual assistant that can help you with various tasks. What can I do for you today?
write a story
Once upon a time, in a small village nestled in the mountains, there lived a young girl named Sofia. She was a curious and imaginative child who loved to explore the world around her. One day, while wandering through the woods near her home, she stumbled upon an old, weathered book lying on the ground.

Without thinking, Sofia picked up the book and opened it. As she began to read, she was transported to a magical land where dragons roamed the skies and unicorns grazed in fields of gold. The words on the page came alive before her eyes, and she felt as though she were part of the story.

Inspired by the book, Sofia began to write her own stories, pouring her imagination onto the page in a frenzy of creativity. She wrote about brave knights and beautiful princesses, about wicked witches and mischievous fairies. Her stories were filled with adventure, romance, and mystery, and she could hardly wait to see what other magical worlds she could create.

As Sofia grew older, she continued to write, and her talents did not go unnoticed. She was soon recognized as a gifted writer, and her stories were published in books and magazines around the world. Her fame brought her many accolades, but she never forgot her humble roots in the small village nestled in the mountains.

Years later, when Sofia received an invitation from the queen of England to join a royal writing workshop, she knew that it was an opportunity too good to pass up. She packed her bags and set off on a journey that would take her to the heart of London.

Sofia arrived at the workshop, where she met other writers and artists from around the world. They were all there to learn from the queen's personal writing coach, a renowned author in his own right.

The weeks passed quickly, and Sofia learned so much about the art of writing. She studied grammar, punctuation, and style, and she learned how to create characters that readers would love. But the most important thing she learned was that writing was not just about crafting stories for others to read. It was about connecting with people on a deeper level and sharing your experiences with them.

As Sofia returned home, she brought with her a newfound confidence and a renewed sense of purpose. She continued to write, but now she wrote with a deeper meaning, using her words to inspire others and share her own experiences. And though she may have left the small village behind, she never forgot the magic that had surrounded her as a child, and she carried it with her always.

llama_print_timings: load time = 6174.31 ms
llama_print_timings: sample time = 753.58 ms / 617 runs ( 1.22 ms per token)
llama_print_timings: prompt eval time = 1430.59 ms / 67 tokens ( 21.35 ms per token)
llama_print_timings: eval time = 28415.08 ms / 617 runs ( 46.05 ms per token)
llama_print_timings: total time = 58155.29 ms
PS F:\LLAMA\llama.cpp>


In my case was a corrupt model.

Sorry for hassle...

Anyway llama.cpp should has some checksum checking to prevent such situation.

@JohannesGaessler
Copy link
Collaborator

I'm glad the issue could be resolved. However, I don't think you can integrate checksums in a useful manner because the checksums are going to be different for each and every finetune.

@omasoud
Copy link

omasoud commented Jun 10, 2023

Interestingly I tried @mirek190 's two q4_K_M files:

WizardLM-7B-uncensored.ggmlv3.q4_K_M.bin
Worked fine.

wizardLM-7B.ggmlv3.q4_K_M.bin
Garbled! (Verified checksum and redownloaded, problem remained)

@mirek190 , did you also retest the latter one and it worked for you?

@mirek190
Copy link

@omasoud
I already said is working.
My model was corrupted

@RahulVivekNair
Copy link
Contributor

@JohannesGaessler is there any way to disable the use of a vram scratch buffer in the latest master. The number of layers possible to be offloaded on my scrawny 6GB VRAM has reduced by 2-10 depending upon the model due to extra vram usage compared to the earlier commit.

@JohannesGaessler
Copy link
Collaborator

There currently isn't.

@TonyWeimmer40
Copy link

TonyWeimmer40 commented Jun 13, 2023

Using the latest build 74a69d2 on Release x64 (w/ Windows) has solved the gibberish issue for me and is now faster than CPU for me, posting in case anyone else faced similar issues.

A downside though is that RAM usage is 10x higher using CUBLAS over CPU.

@aseok
Copy link

aseok commented Jun 23, 2023

Same issue. Using termux on sm8250 (snapdragon 870) with 8gb memory, built on latest commit on the master branch, getting gibberish output with offloading ( -ngl 1 to 35) with llama-7b.ggmlv3.q2_K.bin model.

@fvillena
Copy link

I still have the same problem: When offloading to GPU, the webserver produce gibberish. The gibberish seems to be proportional to the amount of offloaded layers to GPU. I am running the server inside Docker and llama.cpp was compiled with LLAMA_CUBLAS=1 pip install llama-cpp-python.

This is the request:

{
    "messages": [
        {
            "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.",
            "role": "system"
        },
        {
            "content": "What is the capital of France?",
            "role": "user"
        }
    ],
    "max_tokens": 64,
    "stop": [
        "\n",
        "###"
    ]
}

This is the response offloading 35 layers:

{
    "id": "chatcmpl-305b897a-5b5d-4949-b8bb-75533f3cdb7c",
    "object": "chat.completion",
    "created": 1689778618,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content":"I���\t�!������� #�\b�$��� ⁇ "
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 28,
        "total_tokens": 164
    }
}

This is the response offloading 20 layers:

{
    "id": "chatcmpl-eba237e7-0242-471c-a94e-9b2fb7d4be05",
    "object": "chat.completion",
    "created": 1689778338,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Izekunoׁalunoarchiviato sierp spoleestiunion式ollings하ichersteinkenassocroidrotter infinite Fal cor PriholderREDтарRLs unarDiffranit catuelle regularizedgiazystzystcoveredpol estavencruto publicoogleRLsyczetondernfli教iernolpterдий"
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 64,
        "total_tokens": 200
    }
}

This is the response offloading 0 layers:

{
    "id": "chatcmpl-dc38534d-14d5-4da5-ac99-bba1426bf66d",
    "object": "chat.completion",
    "created": 1689781028,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Thank you for asking! I apologize, but the capital of France is actually Paris. It's important to provide accurate and reliable information, so I can't provide a different answer. Additionally, it's always best to verify information through multiple sources to ensure its accuracy. Is there anything else I can help"
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 64,
        "total_tokens": 200
    }
}

I have a GTX 1070 Ti 8GB GPU

@JohannesGaessler
Copy link
Collaborator

Don't resurrect closed issues for this. Garbled outputs can have any number of causes so the issue you're having is likely unrelated.

@Seven2Nine
Copy link

I still have the same problem: When offloading to GPU, the webserver produce gibberish. The gibberish seems to be proportional to the amount of offloaded layers to GPU. I am running the server inside Docker and llama.cpp was compiled with LLAMA_CUBLAS=1 pip install llama-cpp-python.

This is the request:

{
    "messages": [
        {
            "content": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.",
            "role": "system"
        },
        {
            "content": "What is the capital of France?",
            "role": "user"
        }
    ],
    "max_tokens": 64,
    "stop": [
        "\n",
        "###"
    ]
}

This is the response offloading 35 layers:

{
    "id": "chatcmpl-305b897a-5b5d-4949-b8bb-75533f3cdb7c",
    "object": "chat.completion",
    "created": 1689778618,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content":"I���\t�!������� #�\b�$��� ⁇ "
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 28,
        "total_tokens": 164
    }
}

This is the response offloading 20 layers:

{
    "id": "chatcmpl-eba237e7-0242-471c-a94e-9b2fb7d4be05",
    "object": "chat.completion",
    "created": 1689778338,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Izekunoׁalunoarchiviato sierp spoleestiunion式ollings하ichersteinkenassocroidrotter infinite Fal cor PriholderREDтарRLs unarDiffranit catuelle regularizedgiazystzystcoveredpol estavencruto publicoogleRLsyczetondernfli教iernolpterдий"
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 64,
        "total_tokens": 200
    }
}

This is the response offloading 0 layers:

{
    "id": "chatcmpl-dc38534d-14d5-4da5-ac99-bba1426bf66d",
    "object": "chat.completion",
    "created": 1689781028,
    "model": "/models/llama-2-7b-chat.ggmlv3.q2_K.bin",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Thank you for asking! I apologize, but the capital of France is actually Paris. It's important to provide accurate and reliable information, so I can't provide a different answer. Additionally, it's always best to verify information through multiple sources to ensure its accuracy. Is there anything else I can help"
            },
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 136,
        "completion_tokens": 64,
        "total_tokens": 200
    }
}

I have a GTX 1070 Ti 8GB GPU

I have same problem, did you find the way to resolve this.

@Art10001
Copy link

Art10001 commented Dec 3, 2023

It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me.

When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get CUDA error 2 at /home/ubuntu/llama.cpp/ggml-cuda.cu:1241: out of memory even thought I should have plenty free (10GB) only requesting 10 layers on a 7B model
However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok.
Which commit do I need to pull to try a re-build before the issue occured?

Could this be related?
WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Awesome tip of thanks.

Due to current cuda bug you need to set no pinned for enviroment variables. Command for it: export GGML_CUDA_NO_PINNED=1

Now the old commit works under WSL, will try the latest again

UPDATE: Yes works fine on the latest commit under WSL2, as long as you disabled pinned memory.

GGML_CUDA_NO_PINNED=1 worked to solve this issue. Thanks!
Edit: Had been experiencing in Arch Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Very important issue
Projects
None yet
Development

No branches or pull requests