-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
with the newest builds i only get gibberish output #1735
Comments
Same, It gives gibberish output only when layers are offloaded to the gpu via -ngl. Without offload it works as it should.I had to roll back to then pre cuda refactor commit. |
Same here, this only happens when offloading layers to GPU and running on CPU works fine. Also, I noticed the more GPU layers you have the more gibberish you get. |
Thank you for reporting this issue. Just to make sure: are you getting garbage outputs with all model sizes, even with 7b? |
Also to make sure: is there anyone with this issue that is compiling llama.cpp themselves or is everyone using the precompiled Windows binary? |
I tried different models and model sizes, and they all produce gibberish using GPU layers but work fine using CPU. Also, I am compiling from the latest commit on the master branch, using Windows and cmake. |
In llama.cpp line 1158 there should be:
Someone that is experiencing the issue please try to replace that line with this:
Ideally the problem is just that Windows for whatever reason needs more VRAM scratch than Linux and in that case the fix would be to just use a bigger scratch buffer. Otherwise it may be that I'm accidentally relying on undefined behavior somewhere and the fix will be more difficult. |
Yes agree with @dranger003 above, local compile does not fix the issue. I also tried the cublas and clblas both options produce gibberish. I only have one GPU. Do I need any new command line options? Will try the the change that @JohannesGaessler suggests above. |
I've tried it with all the types of quantizations and model sizes. Still produces some weird gibberish output. |
Same issue on my end with this change.
|
This does not fix the issue for me.
|
Alright. I currently don't have CUDA installed on my Windows partition but I'll go ahead and install it to see if I can reproduce the issue. |
Thanks, this is the command I use to compile on my end.
|
Is it working as intended on Linux? |
It is working as intended on my machines which all run Linux. The first step for me to make a fix is to be able to reproduce the issue and the difference in operating system is the first thing that seems worthwhile to check. Of course, if someone that is experiencing the issue could confirm whether or not they have the same problem on Linux that would be very useful to me. |
When I compile this under WSL2 and run with -ngl 0 this works ok. When I run with -ngl 10 then I get However I have never run under WSL before so it might be another issue with my setup but accelerated prompt processing is ok. Which commit do I need to pull to try a re-build before the issue occured? |
Did you revert the change that increases the size of the VRAM scratch buffer? In any case, since the GPU changes that I did are most likely the problem the last good commit should be |
I don't have a Linux install to test at the moment, but on Windows I confirm commit |
Could this be related? WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230 |
I have reverted the changes and checked out the 44f906e commit. On my version of WSL2 this still does not work and gives the same out of memory error so I guess I probably have a WSL / cuda setup issue. On Windows I can compile and the code works fine from this commit. |
Working fine on WSL2 (Ubuntu) using CUDA on commit |
Awesome tip of thanks. Due to current cuda bug you need to set no pinned for enviroment variables. Command for it: Now the old commit works under WSL, will try the latest again UPDATE: Yes works fine on the latest commit under WSL2, as long as you disabled pinned memory. |
Can confirm, ran under WSL and the output is as expected. Something wrong only on the windows side with the gibberish output. |
I have bad news: on my main desktop I am not experiencing the bug when using Windows. I'll try setting up llama.cpp on the other machine that I have. |
If it helps I am using the following config.
|
And here's mine.
And also this one.
|
here it's
maybe it is a windows 10 thing? |
I'm on Windows 11 and I have the issue, build 10.0.22621.1778 is Windows 11 btw. |
I'm now able to reproduce the issue. On my system it only occurs when I use the |
I also have RTX 3090. With the last working build before GPU layers were broken master-35a8491 PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' llama_print_timings: load time = 2366.95 ms Then cloned your branch for test if GPU layers are fixed from here https://github.com/JohannesGaessler/llama.cpp.git PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' llama_print_timings: load time = 3935.70 ms So looks OK for me. But new models not working like q4_k_m for instance. PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 llama_init_from_file: failed to load model |
Also tested newest main tree build. Old models are fine with offloading on GPU or CPU only. |
Did you confirm that it is specifically commit |
I just pulled latest master and compiled; gibberish problem is gone now. Thanks @JohannesGaessler.
|
Have you tested with the newest for instance q4_K_M model? |
Tested on |
Newest main - Windows 10 cublas build PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' == Running in interactive mode. ==
Not fixed at all. |
F:\LLAMA\test>git clone https://github.com/ggerganov/llama.cpp.git F:\LLAMA\test\llama.cpp>git checkout 17366df HEAD is now at 17366df Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703) cmake .. -DLLAMA_CUBLAS=ON PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' == Running in interactive mode. ==
llama_print_timings: load time = 2501.77 ms Still broken ..... |
It looks like there is another problem. I rebuilt with OpenBLAS and the issue showed up again.
|
for me even cublass not working properly |
@mirek190 and you can confirm that with revision |
I check that commit in 2-3 hours and let you know. |
still just gibberish, tried all K models. |
@JohannesGaessler Here you go Cublas 12 and full offloading with GPU ( RTX 3090 ) win 10 git clone https://github.com/ggerganov/llama.cpp.git mkdir build PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' == Running in interactive mode. ==
Total gibberish with wizardLM-7B.ggmlv3.q4_K_M.bin :( PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/wizardLM-7B.ggmlv3.q4_0.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' == Running in interactive mode. ==
Is total fine with wizardLM-7B.ggmlv3.q4_0.bin Conclusion : q4_0 - is ok |
Thank you for testing. It seems like there may be more than one bug then: one that seems to affect all models on Windows and that should be fixed now and possible another bug that affects k-quants specifically. Note that gibberish can also be caused by breaking quantization changes or corrupt files though and for a fix a dev will need to be able to somehow reproduce the issue (I can't so far). |
OMG WORKING! Yore're RIGHT. I downloaded again and WORKS even with the latest main current built. PS F:\LLAMA\llama.cpp> build\bin\main --model models/new2/WizardLM-7B-uncensored.ggmlv3.q4_K_M.bin --mlock --color --threads 8 --keep -1 --batch_size 512 --n_predict -1 --top_k 10000 --top_p 0.9 --temp 0.96 --repeat_penalty 1.1 --ctx_size 2048 --interactive --instruct --reverse-prompt "### Human:" --reverse-prompt "### User:" --reverse-prompt "### Assistant:" -ngl 33 system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | ' == Running in interactive mode. ==
Without thinking, Sofia picked up the book and opened it. As she began to read, she was transported to a magical land where dragons roamed the skies and unicorns grazed in fields of gold. The words on the page came alive before her eyes, and she felt as though she were part of the story. Inspired by the book, Sofia began to write her own stories, pouring her imagination onto the page in a frenzy of creativity. She wrote about brave knights and beautiful princesses, about wicked witches and mischievous fairies. Her stories were filled with adventure, romance, and mystery, and she could hardly wait to see what other magical worlds she could create. As Sofia grew older, she continued to write, and her talents did not go unnoticed. She was soon recognized as a gifted writer, and her stories were published in books and magazines around the world. Her fame brought her many accolades, but she never forgot her humble roots in the small village nestled in the mountains. Years later, when Sofia received an invitation from the queen of England to join a royal writing workshop, she knew that it was an opportunity too good to pass up. She packed her bags and set off on a journey that would take her to the heart of London. Sofia arrived at the workshop, where she met other writers and artists from around the world. They were all there to learn from the queen's personal writing coach, a renowned author in his own right. The weeks passed quickly, and Sofia learned so much about the art of writing. She studied grammar, punctuation, and style, and she learned how to create characters that readers would love. But the most important thing she learned was that writing was not just about crafting stories for others to read. It was about connecting with people on a deeper level and sharing your experiences with them. As Sofia returned home, she brought with her a newfound confidence and a renewed sense of purpose. She continued to write, but now she wrote with a deeper meaning, using her words to inspire others and share her own experiences. And though she may have left the small village behind, she never forgot the magic that had surrounded her as a child, and she carried it with her always. llama_print_timings: load time = 6174.31 ms In my case was a corrupt model. Sorry for hassle... Anyway llama.cpp should has some checksum checking to prevent such situation. |
I'm glad the issue could be resolved. However, I don't think you can integrate checksums in a useful manner because the checksums are going to be different for each and every finetune. |
Interestingly I tried @mirek190 's two q4_K_M files: WizardLM-7B-uncensored.ggmlv3.q4_K_M.bin wizardLM-7B.ggmlv3.q4_K_M.bin @mirek190 , did you also retest the latter one and it worked for you? |
@omasoud |
@JohannesGaessler is there any way to disable the use of a vram scratch buffer in the latest master. The number of layers possible to be offloaded on my scrawny 6GB VRAM has reduced by 2-10 depending upon the model due to extra vram usage compared to the earlier commit. |
There currently isn't. |
Using the latest build 74a69d2 on Release x64 (w/ Windows) has solved the gibberish issue for me and is now faster than CPU for me, posting in case anyone else faced similar issues. A downside though is that RAM usage is 10x higher using CUBLAS over CPU. |
Same issue. Using termux on sm8250 (snapdragon 870) with 8gb memory, built on latest commit on the master branch, getting gibberish output with offloading ( -ngl 1 to 35) with llama-7b.ggmlv3.q2_K.bin model. |
I still have the same problem: When offloading to GPU, the webserver produce gibberish. The gibberish seems to be proportional to the amount of offloaded layers to GPU. I am running the server inside Docker and llama.cpp was compiled with This is the request:
This is the response offloading 35 layers:
This is the response offloading 20 layers:
This is the response offloading 0 layers:
I have a GTX 1070 Ti 8GB GPU |
Don't resurrect closed issues for this. Garbled outputs can have any number of causes so the issue you're having is likely unrelated. |
I have same problem, did you find the way to resolve this. |
|
After the CUDA refactor PR #1703 by @JohannesGaessler was merged i wanted to try it out this morning and measure the performance difference on my ardware.
I use my standard prompts with different models in different sizes.
I use the prebuild versions win-cublas-cu12.1.0-xx64
With the new builds I only get gibberish as a response for all prompts used and all models.
It looks like a random mix of words in different languages.
On my current PC I can only use the win-avx-x64 version, here I still get normal output.
I will use the Cuda-pc again in a few hours, then I can provide sample output or more details.
Am I the only one with this problem?
The text was updated successfully, but these errors were encountered: