koboldcpp-1.23beta #179
Replies: 10 comments 24 replies
-
Please share your performance benchmarks with CLBlast GPU offloading. For good comparison, let us know:
|
Beta Was this translation helpful? Give feedback.
-
Here are my system specs:
Model: ggml-gpt4-x-alpasta-30b-q4_0.bin (MetalX) Prompt: Results: Subjectively, I can see that adding --gpulayers is putting stuff into my GPU's memory in 1.23. |
Beta Was this translation helpful? Give feedback.
-
My system specs:
Model used: wizard-mega-13B.ggml.q5_1.bin, compiled koboldcpp from main branch as of today. Arguments: Results for generation of 100 tokens with 46 context tokens:
I think this GPU is just too slow and has too little VRAM for any meaningful speedup through OpenCL. 14 layers and above pretty much fills up all the available VRAM. |
Beta Was this translation helpful? Give feedback.
-
Trim sentences somehow broke with this update for me. Suddenly always trims partially completed sentences, irregardless of the setting. Smartcontext does not seem to mesh well with GPU offloading. Don't have time to re-run it, and might forget about it all together. I recommend people to benchmark at max context with smartcontext off vs on to verify their compatibility. |
Beta Was this translation helpful? Give feedback.
-
Specs
Test with
As you can see the generation time is slightly lower in 1.23, but the processing time is doubled, all of the tests were done with the exact same prompt. And the processing time difference is pretty consistent between runs and also applies to smaller models. As a person that usually have chats with very large contexts the processing time is a pretty big deal to me, especially if streaming mode is in use (it was not in use for these tests but I do sometimes use it) so it should be no surprise that my answer to the question Is Pepsi Okay? will have to be No, I'd prefer Coke please 😆. Also just out of curiosity is there a technical reason why |
Beta Was this translation helpful? Give feedback.
-
Hi all, I've updated a new 1.23.1 version which has the fp16 set to disabled, as it appears that a majority of people find it slower than without. Please let me know if prompt processing with GPU layer offloading in 1.23.1 is still slower than 1.21.3. |
Beta Was this translation helpful? Give feedback.
-
I could use some clarification. Is the gpu offloading in this release different from the CUDA / cuBLAS version? |
Beta Was this translation helpful? Give feedback.
-
Specs:
Cmd: koboldcpp-1.21.3\koboldcpp.exe --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.22-CUDA-ONLY\koboldcpp_CUDA_only.exe --gpulayers 16 --highpriority --smartcontext --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23beta\koboldcpp.exe --gpulayers 14 --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23.1\koboldcpp.exe --gpulayers 14 --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23.1\koboldcpp.exe --blasbatchsize 1024 --gpulayers 14 --highpriority --smartcontext --threads 6 --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin Test: I ran 5 generations per model with the same prompts and calculated the averages. Layers were the highest number that wouldn't crash on load or during processing/generation. Results:
*: Ran it again with |
Beta Was this translation helpful? Give feedback.
-
Finally I can start to utilize my AMD GPU... └(^o^)┐ Specs
Models --gpulayers --useclblast 0 0 --blasbatchsize 1024 --threads 8 --smartcontext Results
7B without CLBlast
13B with CLBlast GPU offloading
13B without CLBlast
\(^O^)/ |
Beta Was this translation helpful? Give feedback.
-
I think something is wrong in v1.23.1 In Instruct mode, click on Memory, change it to whatever you want, then submit a test instruction. Check the command prompt, you will see that your "memory" is there together with the default memory instruction right after, instead of overwriting the default. So I set the memory to "test" and send a simple instruct "hi". Here's my command prompt:
I'm pretty sure the previous versions didn't do this, just fyi. I could be wrong I'd have to double check to be sure, but figure I'll let you know. Also, I'm not sure what version this started in, I only happened to notice it just now. |
Beta Was this translation helpful? Give feedback.
-
koboldcpp-1.23beta
A.K.A The "Is Pepsi Okay?" edition.
Changes:
--useclblast
, combine with--gpulayers
to pick number of layers to offloadTo use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the
--help
flag.This release also includes a zip file containing the libraries and the
koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.This discussion was created from the release koboldcpp-1.23beta.
Beta Was this translation helpful? Give feedback.
All reactions