Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.26

27 May 10:06
Compare
Choose a tag to compare

koboldcpp-1.26

KoboldCpp Changes:

  • NEW! Now, you can view Token Probabilities when using --debugmode. When enabled, for every generated token, the console will display the probabilities of up to 4 alternative possible tokens. Good way to know how biased/confident/overtrained a model is. The probability percentage values shown are after all the samplers have been applied, so it's also a great way to test your sampler configurations to see how good they are. --debugmode also displays the contents of your input and context, as well as their token IDs. Note that using --debugmode has a slight performance hit, so it is off by default.
  • NEW! The Top-A sampler has been added! This is my own implementation of a special Kobold-exclusive sampler that does not exist in the upstream llama.cpp repo. This sampler reduces the randomness of the AI whenever the probability of one token is much higher than all the others, proportional to the squared softmax probability of the most probable token. Higher values have a stronger effect. (Put this value on 0 to disable its effect).
  • Added support for the Starcoder and Starcoder Chat models.
  • Cleaned up and slightly refactored the sampler code, EOS stop tokens should now work for all model types, use --unbantokens to enable it. Additionally, the left square bracket [ token is no longer banned by default as modern models don't really need it, and the token IDs were inconsistent across architectures.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.25.1

24 May 07:34
Compare
Choose a tag to compare

koboldcpp-1.25.1

KoboldCpp Changes:

  • Add a new Failsafe mode, triggered by running --noavx2 --noblas --nommap which disables all CPU intrinsics, allowing even ancient devices with no AVX or SSE support to run KoboldCpp, though they will be extremely slow.
  • Fixed a bug in the GUI that selected noavx2 mode incorrectly.
  • Pulled new changes for other non-llama architectures. In particular, the GPT Tokenizer has been improved.
  • Added support for setting the sampler_seed via the /generate API. Please refer to KoboldAI API documentation for details.
  • Pulled upstream fixes and enhancements, and compile fixes for other architectures.
  • Added more console logging in --debugmode which can now display the context token contents.

Edit: v1.25.1

  • Changed python for pyinstaller from 3.9 to 3.8. Combined with a change in failsafe mode that avoids PrefetchVirtualMemory, failsafe mode should now work in Windows 7! To use it, run with --noavx2 --noblas --nommap and failsafe mode will trigger.
  • Upgraded CLBlast to 1.6

Kobold Lite UI Changes:

  • Kobold Lite UI now supports variable streaming lengths (defaults to 8 tokens), you can see by adding ?streamamount=[value] to the URL after launching with --stream
  • Removed newlines from automatically being inserted into the very start of chat scenarios. The chat regex has been slightly adjusted.
  • Above change was reverted as it was buggy.
  • Remove default Alpaca instruction prompt as it was less useful on newer instruct models. You can still use it by adding it to Memory.
  • Fixed an autosave bug which happened sometimes when disconnecting while using Lite.
  • Greatly improved markdown support
  • Added drag and drop load file functionality

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.24

21 May 12:17
Compare
Choose a tag to compare

koboldcpp-1.24

A.K.A The "He can't keep getting away with it!" edition.

KoboldCpp Changes:

  • Added support for the new GGJT v3 (q4_0, q4_1 and q8_0) quantization format changes.
  • Still retains backwards compatibility with every single historical GGML format (GGML, GGHF, GGJT v1,2,3 + all other formats from supported architectures).
  • Fixed F16 format detection in NeoX, including a fix for use_parallel_residual.
  • Various small fixes and improvements, sync to upstream and updated Kobold Lite.

Embedded Kobold Lite has also been updated, with the following changes:

  • Improved the spinning circle waiting animation to use less processing.
  • Fixed a bug with stopping sequences when in streaming mode.
  • Added a toggle to avoid inserting newlines in Instruct mode (good for Pygmalion and OpenAssistant based instruct models).
  • Added a toggle to enable basic markdown in instruct mode (off by default).

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

EDIT: An alternative CUDA build has been provided by Henky for this version, to allow access to the latest quantizations for CUDA users. Do note that it only supports the latest version of LLAMA based models. CUDA builds will still not be generated by default, and support for them will be limited.

koboldcpp-1.23.1

17 May 13:36
Compare
Choose a tag to compare

koboldcpp-1.23.1

A.K.A The "Is Pepsi Okay?" edition.

Changes:

  • Integrated support for the new quantization formats for GPT-2, GPT-J and GPT-NeoX
  • Integrated Experimental OpenCL GPU Offloading via CLBlast (Credits to @0cc4m)
    • You can only use this in combination with --useclblast, combine with --gpulayers to pick number of layers to offload
    • Currently works for new quantization formats of LLAMA models only
    • Should work on all GPUs
  • Still supports all older GGML models, however they will not be able to enjoy new features.
  • Updated Lite, integrated various fixes and improvements from upstream.

1.23.1 Edit:

  • Pulled Occam's fix for the q8 dequant kernels, so now q8 formats can enjoy GPU offloading as well.
  • Disabled fp16 prompt processing as it appears to be slower. Please compare!

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

Please share your Performance Bencharks for CLBlast GPU offloading or issues here: #179 . Do include whether your GPU supports F16.

**koboldcpp-1.22-CUDA-ONLY**

15 May 16:04
Compare
Choose a tag to compare

koboldcpp-1.22-CUDA-ONLY (Special Edition)

A.K.A The "Look what you made me do" edition.

Changes:

  • This is a (one-off?) limited edition CUDA only build.
  • Only NVIDIA GPUs will work for this.
  • This build does not support CLblast or OpenBLAS. Selecting OpenBLAS or CLBlast options still loads CUBLAS.
  • This build does not support running old quantization formats (this is a limitation of the upstream CUDA kernel).
  • This build DOES support GPU Offloading via CUBLAS. To use that feature, select number of layers to offload e.g. --gpulayers 32
  • This build is very huge because of the CUBLAS libraries bundled with it. It requires CUDA Runtime support for 11.8 and up.

For those who want the previous version, please find v1.21.3 here: https://github.com/LostRuins/koboldcpp/releases/tag/v1.21.3

To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.21.3

13 May 05:06
Compare
Choose a tag to compare

koboldcpp-1.21.3

KNOWN ISSUES: PLEASE READ

  • If you are using v1.21.1 and v1.21.0, there's a misalignment with one of the structs which can cause some models to output nonsense randomly. Please update to v1.21.2
  • CLBlast seems to be broken on q8_0 formats in v1.21.0 to v1.21.2. Please update to 1.21.3

Changes:

  • Integrated the new quantization formats while maintaining backward compatibility for all older ggml model formats. This was a massive undertaking and it's possible there may be bugs, so please do let me know if anything is broken!
  • Fixed some rare out of memory errors that occurred when using GPT2 models with BLAS.
  • Updated Kobold Lite: New features include multicolor names, idle chat responses, toggle for the instruct prompts, and various minor fixes.

1.21.1 edit:

  • Cleaned up some unnecessary prints regarding BOS first token. Added an info message encouraging OSX users to use Accelerate instead of OpenBLAS since it's usually faster (--noblas)

1.21.2 edit:

  • Fixed a error with the OpenCL kernel failing to compile on certain platforms. Please help check.
  • Fixed a problem when logits would sometimes be NaN due to an unhandled change in size of the Q8_1 struct compared to previously. This also affected other formats such as NeoX, RedPajama and GPT2 so you are recommended to upgrade to 1.21.2

1.21.3 edit

  • Recognize q8_0 as an older format as the new clblast kernel doesnt work correctly with it.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.20

08 May 13:19
Compare
Choose a tag to compare

koboldcpp-1.20

  • Added an option to allocate more RAM for massive context sizes, to allow testing with models with > 2048 context. You can change this with the flag --contextsize
  • Added experimental support for the new RedPajama variant of GPT-NeoX models. As the model formats are nearly identical to Pythia, this was particularly tricky to implement. This uses a very ugly hack to determine whether it's a RedPajama model. If detection fails, you can always force it with the flag --forceversion

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

koboldcpp-1.19.1

06 May 04:12
Compare
Choose a tag to compare

koboldcpp-1.19

  • Integrate --usemirostat option for all model types. This must be set at launch, and replaces your normal stochastic samplers with mirostat. Takes 3 params [type][tau][eta], e.g. --usemirostat 2 5.0 0.1 Works on all models, but noticeably bad on smaller ones. Follows the upstream implementation. More info here.

  • Added an option --forceversion [ver]. If the model file format detection fails (e.g. A rogue modified model) you can set this to override the detected format (enter desired version, e.g. 401 for GPTNeoX-Type2).

  • Added an option --blasthreads, which controls threads when ClBlast is active. Some people wanted to use a different thread count when CLBlast was active and got overall speedups, so now you can experiment. Uses the same value as --threads if not specified.

  • Integrated new improvements for RWKV. This provides support for all the new RWKV quantizations, but drops support for Q4_1_O following the upstream - this way I only need to maintain one library. RWKV q5_1 should be much faster than fp16 but perform similarly.

  • Bumped up the buffer size slightly to support Chinese alpaca.

  • Integrated upstream changes and improvements, various small fixes and optimizations.

  • Fixed a bug where GPU device was set incorrectly in clblast

  • Special: An experimental Windows 7 Compatible .exe is included for this release, to attempt to provide support for older OS. Let me know if it works (for those still stuck on Win7). Don't expect it to be in every release though.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.18

02 May 14:52
Compare
Choose a tag to compare

koboldcpp-1.18

  • This release brings a new feature within Kobold Lite - Group Conversations. In chat mode, you can now specify multiple Chat Opponents (delimited with ||$||) which will trigger a simulated group chat, allowing the AI to reply as different people. Note that this does not work very well in Pygmalion models, as they were trained mainly on 1 to 1 chat. However it seems to work well in LLAMA based models. Each chat opponent will add a custom stopping sequence (max 10). Works best with Multiline Replies disabled. To demonstrate this, a new Scenario Class Reunion has been added in Kobold Lite.
  • Added a new flag --highpriority, which increases the CPU priority of the process, potentially speeding up generation timings. See #133 your mileage may vary depending on memory bottlenecks. Do share if you experience significant speedups.
  • Added the --usemlock parameter to keep model in RAM, for Apple M1 users.
  • Fixed a stop_sequence bug which caused a crash
  • Added error information display when the tkinter GUI fails to load
  • Pulled upstream changes and fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

koboldcpp-1.17

01 May 16:57
Compare
Choose a tag to compare

koboldcpp-1.17

  • Removed Cloudflare Insights - this was previously in Kobold Lite and was included in KoboldCpp. For disclosure: Cloudflare Insights is a GDPR compliant tool that Kobold Lite used previously used to provide information on browser and platform distribution (e.g. ratio of desktop/mobile users), browser type (chrome/firefox etc), to determine which browser platforms I have to support for Kobold Lite. You can read more about it here: https://www.cloudflare.com/insights/ It did not track any personal information, and did not relay any data you load, use, enter or access within Kobold. It was not intended to be included in KoboldCpp, and I originally removed it but forgot for subsequent versions. As of this version, it is removed from both Kobold Lite and KoboldCpp by request.

  • Added the Token Unbanning to the UI, and allowed it to prevent generation of the EOS token, which is required for newer Pygmalion models. You can trigger it with --unbantokens

  • Pulled upstream fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.