koboldcpp-1.23beta #179

LostRuins · 2023-05-17T13:36:48Z

LostRuins
May 17, 2023
Maintainer

koboldcpp-1.23beta

A.K.A The "Is Pepsi Okay?" edition.

Changes:

Integrated support for the new quantization formats for GPT-2, GPT-J and GPT-NeoX
Integrated Experimental OpenCL GPU Offloading via CLBlast (Credits to @0cc4m)
- You can only use this in combination with --useclblast, combine with --gpulayers to pick number of layers to offload
- Currently works for new quantization formats of LLAMA models only
- Should work on all GPUs
Still supports all older GGML models, however they will not be able to enjoy new features.
Updated Lite, integrated various fixes and improvements from upstream.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

This discussion was created from the release koboldcpp-1.23beta.

LostRuins · 2023-05-17T13:45:49Z

LostRuins
May 17, 2023
Maintainer Author

Please share your performance benchmarks with CLBlast GPU offloading. For good comparison, let us know:

Your System specs and GPU, and whether it supports F16.
What model, size and quantization format you are using.
Your results for speed, how many layers offloaded vs none, time per token for processing and generation

0 replies

noprotocolunit · 2023-05-17T15:38:34Z

noprotocolunit
May 17, 2023

Here are my system specs:

AMD Ryzen i7 (8 cores)
64 GB of RAM
RTX 2080

Model: ggml-gpt4-x-alpasta-30b-q4_0.bin (MetalX)
Parameters:--threads 10 --stream --smartcontext --useclblast 0 0 --blasbatchsize 256 --highpriority --nommap

Prompt:
Processing Prompt [BLAS] (174 / 174 tokens)
Generating (8 / 8 tokens)

Results:
1.23 (no gpu layers): Time Taken - Processing:14.9s (86ms/T), Generation:4.0s (495ms/T), Total:18.9s
1.23 (12 gpu layers): Time Taken - Processing:14.7s (84ms/T), Generation:3.4s (428ms/T), Total:18.1s
1.22 (12 gpu layers): Time Taken - Processing:6.3s (36ms/T), Generation:3.5s (437ms/T), Total:9.8s

Subjectively, I can see that adding --gpulayers is putting stuff into my GPU's memory in 1.23.

3 replies

LostRuins May 17, 2023
Maintainer Author

Hmm so your generation speed improved but the processing speed got way worse after offloading, correct? Is this consistent with smaller models too?

noprotocolunit May 17, 2023

Here's my attempt with a smaller model: ggml-gpt4-x-vicuna-13b-q5_1.bin

Processing Prompt [BLAS] (221 / 221 tokens)
Generating (8 / 8 tokens)

1.23 (no layers): Time Taken - Processing:6.9s (31ms/T), Generation:2.1s (261ms/T), Total:9.0s
1.23 (12 layers): Time Taken - Processing:7.0s (32ms/T), Generation:1.7s (217ms/T), Total:8.8s
1.22 (12 layers): Time Taken - Processing:2.6s (12ms/T), Generation:1.8s (229ms/T), Total:4.4s

To me it looks like on 1.23, offloading layers to GPU doesn't impact processing time but has a minor performance increase during generation.
When comparing 1.23 and 1.22, 1.22 processes way faster and generates slightly slower than 1.23.

EliEron May 18, 2023

The way faster processing speed is likely just down to using cuBLAS for the processing rater than CLBlast , and nothing to do with the offloading code. Even before the offloading PR cuBLAS was known to be way faster on NVIDIA hardware than CLBlast for processing.

To back up that assumption here are two runs on my setup with offloading off with version 1.22 vs 1.23

Version	Layers	Processing Time (ms/T)	Generation Time (ms/T)	Total Time (s)
1.22	0	18	1020	238.2
1.23	0	39	1032	276.4

As you can see the generation speed is the same either way (within margin of error) but the processing is significantly shorter, and that is consistent between multiple runs. I've made a more detailed comment earlier with my full specs.

mashwell · 2023-05-17T16:44:36Z

mashwell
May 17, 2023

My system specs:

CPU: AMD Ryzen 5 3600 (6-core)
64GB of DDR4 RAM (3200 MT/s, CL16)
GPU: AMD RX 570 4GB, supports F16 (Using Platform: AMD Accelerated Parallel Processing Device: gfx803 FP16: 1)
Tested on Linux; ran CLBlast tuner on this specific machine (make alltuners) then installed through make install. No noticeable difference compared to stock CLBlast kernels.

Model used: wizard-mega-13B.ggml.q5_1.bin, compiled koboldcpp from main branch as of today.

Arguments: python koboldcpp.py --smartcontext --useclblast 0 0 --blasbatchsize 256 --model ./models/wizard-mega-13B.ggml.q5_1.bin [--gpulayers 14] .
Defaults to 5 threads, using 6 threads only results in marginal performance increase.

Results for generation of 100 tokens with 46 context tokens:

Pure CPU: Processing: 10.6s (231ms/T), Generation: 26.0s (260ms/T), Total: 36.6s
14 GPU layers: Processing: 16.7s (362ms/T), Generation: 30.5s (305ms/T), Total: 47.1s
10 GPU layers: Processing: 10.0s (217ms/T), Generation: 28.6s (286ms/T), Total: 38.6s
8 GPU layers: Processing: 9.8s (213ms/T), Generation: 28.2s (282ms/T), Total: 38.0s

I think this GPU is just too slow and has too little VRAM for any meaningful speedup through OpenCL. 14 layers and above pretty much fills up all the available VRAM.

1 reply

TravelingMan May 17, 2023

I have a very similar system (RX 580 4GB, but with an older Intel i5 4 core CPU) and came to the same conclusion. My times are very close to yours overall, too, just add about 60-90ms/T for my older CPU.

bucketcat · 2023-05-17T20:24:57Z

bucketcat
May 17, 2023

Trim sentences somehow broke with this update for me. Suddenly always trims partially completed sentences, irregardless of the setting.

Smartcontext does not seem to mesh well with GPU offloading. Don't have time to re-run it, and might forget about it all together. I recommend people to benchmark at max context with smartcontext off vs on to verify their compatibility.

5 replies

LostRuins May 18, 2023
Maintainer Author

There shouldn't be any changes regarding sentence trimming. Perhaps you are in Adventure mode instead?

bucketcat May 18, 2023

I'm not, it seems to be in any mode. I mostly use the default story mode, and sometimes chat with the chat GUI disabled. With trim off, it still trims sentences that are partially completed. I did however try older versions and it seems to have started in 1.21 and persisted since.

LostRuins May 18, 2023
Maintainer Author

I just tried the latest version and it is working fine for me. I can generate incomplete sentences in story mode. Maybe your settings got corrupted. Can you try in a new incognito browser window? If that works, clear your cache, or disable "Persist Session" to reset everything to defaults.

bucketcat May 22, 2023

@LostRuins

I just tried the latest version and it is working fine for me. I can generate incomplete sentences in story mode. Maybe your settings got corrupted. Can you try in a new incognito browser window? If that works, clear your cache, or disable "Persist Session" to reset everything to defaults.

I think I found the cause. I block lots of remote fonts and cosmetic mumbojumbo, which broke the functionality of the checkboxes, making them static. After I updated to a new version, the trim setting reset. While disabling it through the checkbox did untick the checkbox, it didn't actually disable it. Woops! Would be nice to have a config file without having to build form source, always easier to change settings that way. Actually, is it possible to have a separate config, then calling --config "./path/config.yaml" as a launch parameter? I do that with some other selfhosted crap on Linux that either defaults the config to strange locations that are hard to find, or requires different/multiple configs for different use cases. If not, could you possibly add that feature pretty please?

LostRuins May 22, 2023
Maintainer Author

Config file wouldn't work as these settings are UI specific - they are set in the browser side. The API or koboldcpp doesn't even see them.

EliEron · 2023-05-18T03:39:03Z

EliEron
May 18, 2023

Specs

CPU: AMD Ryzen 9 5950X 16-Core Processor
GPU: NVIDIA GeForce RTX 3080
RAM: 96GB

Test with 30B Q4_0 Model:

Version	Layers	Processing Time (ms/T)	Generation Time (ms/T)	Total Time (s)
1.22	0	18	1020	238.2
1.23	0	39	1032	276.4
1.22	20	19	915	218.0
1.23	20	35	883	239.4

As you can see the generation time is slightly lower in 1.23, but the processing time is doubled, all of the tests were done with the exact same prompt. And the processing time difference is pretty consistent between runs and also applies to smaller models.

As a person that usually have chats with very large contexts the processing time is a pretty big deal to me, especially if streaming mode is in use (it was not in use for these tests but I do sometimes use it) so it should be no surprise that my answer to the question Is Pepsi Okay? will have to be No, I'd prefer Coke please 😆.

Also just out of curiosity is there a technical reason why blasbatchsize maxes out at 1024? I found that increasing that has a pretty positive effect on my processing time and I can't help but feel my card could handle higher values than 1024. But I haven't looked into the code yet.

5 replies

LostRuins May 18, 2023
Maintainer Author

How does speeds compare to 1.21.3?

EliEron May 18, 2023

Both version are faster than 1.21.3.

Version	Processing Time (ms/T)	Generation Time (ms/T)	Total Time (s)
1.21.3	42	1191	314.4s

Also one thing I forgot to mention, on build 1.23 the following is printed:
Using Platform: NVIDIA CUDA Device: NVIDIA GeForce RTX 3080 FP16: 0

The FP16: 0 bit stood out to me as I assume it means FP16 is disabled, but FP16 is definitively supported and beneficial on the RTX 3080. And I'm able to use FP16 in other AI applications without issue.

LostRuins May 18, 2023
Maintainer Author

The FP16 : 0 is an OpenCL limitation on Nvidia cards

LostRuins May 18, 2023
Maintainer Author

Im assuming you are using the CUDA special edition for 1.22 (not self built) right?

Blasbatchsize caps out because of memory limitations, too difficult to find a config that works well on all sizes.

EliEron May 18, 2023

Im assuming you are using the CUDA special edition for 1.22 (not self built) right?

Correct.

Blasbatchsize caps out because of memory limitations, too difficult to find a config that works well on all sizes.

Okay, that's pretty understandable. It would just be really nice to be able to process the entire 2048 tokens in one batch.

LostRuins · 2023-05-18T09:27:52Z

LostRuins
May 18, 2023
Maintainer Author

Hi all, I've updated a new 1.23.1 version which has the fp16 set to disabled, as it appears that a majority of people find it slower than without.

Please let me know if prompt processing with GPU layer offloading in 1.23.1 is still slower than 1.21.3.
Don't need to compare against v1.22 we all know that is faster.

0 replies

7erminalVelociraptor · 2023-05-18T10:36:00Z

7erminalVelociraptor
May 18, 2023

I could use some clarification. Is the gpu offloading in this release different from the CUDA / cuBLAS version?

3 replies

LostRuins May 18, 2023
Maintainer Author

same concept, different backend

7erminalVelociraptor May 18, 2023

Ah, so this offloads layers using clblast and the "here is your cuda version now leave me alone" release uses cublas? Or does this release use a different library?

Sorry if it's asking for stating the obvious but there is a lot of contrary info around.

LostRuins May 18, 2023
Maintainer Author

Yes, the normal builds will do GPU offloading via CLBlast (OpenCL), that is compatible with both AMD and Nvidia cards. The Cublas build was a special one time version that used CUDA instead.

WolframRavenwolf · 2023-05-18T14:32:59Z

WolframRavenwolf
May 18, 2023

Specs:

CPU: Intel Core i7-10750H (6 Cores, 12 Threads, 2.60GHz)
RAM: Crucial 64GB (2x32GB) DDR4 3200MHz CL22
GPU: NVIDIA GeForce RTX 2070 Super (FP16: 0)
OS: Windows 11 Home (21H2)
LLM: spanielrassler_GPT4-X-Alpasta-30b-ggml.q4_0

Cmd:

koboldcpp-1.21.3\koboldcpp.exe --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.22-CUDA-ONLY\koboldcpp_CUDA_only.exe --gpulayers 16 --highpriority --smartcontext --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23beta\koboldcpp.exe --gpulayers 14 --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23.1\koboldcpp.exe --gpulayers 14 --highpriority --smartcontext --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin
koboldcpp-1.23.1\koboldcpp.exe --blasbatchsize 1024 --gpulayers 14 --highpriority --smartcontext --threads 6 --useclblast 0 0 --usemirostat 2 5.0 0.1 spanielrassler_GPT4-X-Alpasta-30b-ggml/GPT4-X-Alpasta-30b_q4_0.bin

Test:

I ran 5 generations per model with the same prompts and calculated the averages. Layers were the highest number that wouldn't crash on load or during processing/generation.

Results:

Version	Layers	Processing Time (ms/T)	Generation Time (ms/T)
1.21.3	0	375	1265
1.22-CUDA-ONLY	16	227	1166
1.23beta	14	356	1189
1.23.1	14	332	1238
1.23.1*	14	310	1118

*: Ran it again with --blasbatchsize 1024 --threads 6 which sped it up some more, resulting in the best generation time. CUDA's processing time still beats all.

1 reply

LostRuins May 19, 2023
Maintainer Author

Yes thanks for this. At least the generation time seems on par now. Unfortunately processing time still not as good, but already improved

NeoTU · 2023-05-19T03:07:50Z

NeoTU
May 19, 2023

Please share your performance benchmarks with CLBlast GPU offloading.

Should work on all GPUs.

Finally I can start to utilize my AMD GPU... └(^o^)┐

Specs

OS: Windows 11 22H2 (koboldcpp-1.23beta)
SDD: T-Force CARDEA A440 1TB [PCIe-4.0]
CPU: AMD Ryzen 5800X3D 8 Cores 16 Threads [Undervolted in BIOS -25 Offset All Core @ 4.45Ghz]
GPU: AMD 7900 XTX 24GB GDDR6 [Undervolted @ 1100Mv]
RAM: G.Skill Ripjaws V 64GB DDR4 [3200 Mhz @ CL16]
AMD SmartAccess Memory Enabled

Models
TheBloke/Wizard-Vicuna-7B-Uncensored.ggml.q5_1.bin LLAMA model: (ver 4) (May 17)
TheBloke/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin LLAMA model: (ver 4) (May 17)
I'm waiting on MetaIX/GPT4-X-Alpasta-30b-4bit GGML to update so I can test that next.

--gpulayers --useclblast 0 0 --blasbatchsize 1024 --threads 8 --smartcontext
Using Platform: AMD Accelerated Parallel Processing Device: gfx1100 FP16: 1

Results
7B with CLBlast GPU offloading
[opencl] offloading 32 layers to GPU
Total VRAM used: 8.8/24.0GB
Processing Prompt [BLAS] (254 / 254 tokens] Generating (200 / 200 tokens)

Time Taken - Processing: 2.5s (10ms/T), Generation: 17.1s (85ms/T), Total: 19.6s

7B without CLBlast
Total VRAM used: 1.2/24.0GB
Processing Prompt [BLAS] (254 / 254 tokens] Generating (200 / 200 tokens)

Time Taken - Processing: 9.6s (38ms/T), Generation: 31.4s (157ms/T), Total: 41.1s

13B with CLBlast GPU offloading
[opencl] offloading 40 layers to GPU
Total VRAM used: 13.9/24.0GB
Processing Prompt [BLAS] (252 / 252 tokens] Generating (200 / 200 tokens)

Time Taken - Processing: 4.6s (18ms/T), Generation: 25.8s (129ms/T), Total: 30.4s

13B without CLBlast
Total VRAM used: 1.9/24.0GB
Processing Prompt [BLAS] (252 / 252 tokens] Generating (200 / 200 tokens)

Time Taken - Processing: 17.9s (71ms/T), Generation: 61.6s (308ms/T), Total: 79.5s

＼(＾O＾)／
I'm pretty happy with these results because every second adds up, the less the better.
Much thanks to 0cc4, LostRuins and everyone else making these improvements possible!

2 replies

mashwell May 19, 2023

Very cool that you got a meaningful speedup! Though you do have a big honking GPU, haha.

Have you tried lowering the CPU thread count to see if that gives you more tokens/second? Since it looks like that you are getting lower tokens/second than my R5 3600. Then again, part of that might be because of Windows. WSL2 can help, but I'm not sure if you can use your GPU + CLBlast with that.

LiliumSancta May 19, 2023

Unfortunately opencl doesn't run on wsl/2 yet.

lilblam · 2023-05-19T19:44:13Z

lilblam
May 19, 2023

I think something is wrong in v1.23.1

In Instruct mode, click on Memory, change it to whatever you want, then submit a test instruction. Check the command prompt, you will see that your "memory" is there together with the default memory instruction right after, instead of overwriting the default.

So I set the memory to "test" and send a simple instruct "hi". Here's my command prompt:

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.176, "temperature": 0.7, "top_p": 
0.1, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, 
"sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "test\nBelow is an instruction that describes a task. Write a 
response that appropriately completes the request.\n\n\n\n### Instruction:\n\nhi\n\n### Response:\n\nHi, 
how are you? I'", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

I'm pretty sure the previous versions didn't do this, just fyi. I could be wrong I'd have to double check to be sure, but figure I'll let you know. Also, I'm not sure what version this started in, I only happened to notice it just now.

4 replies

mashwell May 19, 2023

Looks like you accidentally enabled Instruct Mode but didn't mean to. You can change it under Settings > Format > Story Mode (or other), or uncheck the Instruction Prompt checkbox under that same menu.

lilblam May 19, 2023

Nope I don't think that's the case, I'm intentionally in Instruct mode. I just tested other versions - the problem started in 1.21.3
Here is Instruct mode for Vicuna 7b ggml - https://huggingface.co/shidowake/vicuna-7B-ggml/tree/main

Kobold 1.20:
Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.176, "temperature": 0.7, "top_p": 0.1, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "test\n\n\n### Instruction:\n\nhi\n\n### Response:\n\nhello\n\ntest\n\n\n##", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

Kobold 1.21.3:
Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.176, "temperature": 0.7, "top_p": 0.1, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "test\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\n\n\n\n### Instruction:\n\nhi\n\n### Response:\n\n", "quiet": true, "stop_sequence": ["\n### Instruction:", "\n### Response:"]}

See the difference in the "prompt"? 1.21.3 just appends the default prompt "Below is an instruction that describes a task. Write a response that appropriately completes the request" even when I change "Memory" to "test". 1.20 doesn't do this - it uses whatever memory I set as the prompt, and doesn't add the default prompt right after.

mashwell May 19, 2023

Unchecking the Instruction Prompt checkbox doesn't remove it?

lilblam May 19, 2023

It does! Ok false alarm, I didn't notice that checkbox, it looks like it was introduced in 1.21.3. Prior to that, if you put anything in memory, it automatically removes the default. Now you explicitly have to uncheck the new checkbox to get rid of the default, regardless of whether you have something custom written in memory or not. That's fine, didn't realize that's how it works now. Thanks for pointing it out, my bad!

koboldcpp-1.23beta #179

LostRuins May 17, 2023 Maintainer

Replies: 10 comments · 24 replies

LostRuins May 17, 2023 Maintainer Author

LostRuins May 17, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 22, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 18, 2023 Maintainer Author

LostRuins May 19, 2023 Maintainer Author

LostRuins
May 17, 2023
Maintainer

Replies: 10 comments 24 replies

LostRuins
May 17, 2023
Maintainer Author

LostRuins May 17, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 22, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins
May 18, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 18, 2023
Maintainer Author

LostRuins May 19, 2023
Maintainer Author