Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.46.1

08 Oct 07:03
Compare
Choose a tag to compare

koboldcpp-1.46.1

Important: Deprecation Notice for KoboldCpp 1.46

  • The following command line arguments are deprecated and have been removed from this version on.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
  • Removed the original deprecated tkinter GUI, now only the new customtkinter GUI remains.
  • Improved embedded horde worker, added even more session stats, job pulls and job submits are now done in parallel so it should run about 20% faster for horde requests.
  • Changed the default model name from concedo/koboldcpp to koboldcpp/[model_filename]. This does prevent old "Kobold AI-Client" users from connecting via the API, so if you're still using that, either switch to a newer client or connect via the Basic/OpenAI API instead of the Kobold API.
  • Added proper API documentation, which can be found by navigating to /api or the web one at https://lite.koboldai.net/koboldcpp_api
  • Allow .kcpps files to be drag & dropped, as well as working via OpenWith in windows.
  • Added a new OpenAI Chat Completions compatible endpoint at /v1/chat/completions (credit: @teddybear082)
  • --onready processes are now started with subprocess.run instead of Popen (#462)
  • Both /check and /abort can now function together with multiuser mode, provided the correct genkey is used by the client (automatically handled in Lite).
  • Allow 64k --contextsize (for GGUF only, still 16k otherwise).
  • Minor UI fixes and enhancements.
  • Updated Lite, pulled fixes and improvements from upstream.

v1.46.1 hotfix: fixed an issue where blasthreads was used for values between 1 and 32 tokens.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.45.2

01 Oct 07:39
Compare
Choose a tag to compare

koboldcpp-1.45.2

  • Improved embedded horde worker: more responsive, and added Session Stats (Total Kudos Earned, EarnRate, Timings)
  • Added a new parameter to grammar sampler API grammar_retain_state which lets you persist the grammar state across multiple requests.
  • Allow launching by picking a .kcpps file in the file selector GUI combined with --skiplauncher. That settings file must already have a model selected. (Similar to --config, but that one doesn't use GUI at all.)
  • Added a new flag toggle --foreground for windows users. This sends the console terminal to the foreground every time a new prompt is generated, to avoid some idling slowdown issues.
  • Increased max support context with --contextsize to 32k, but only for GGUF models. It's still limited to 16k for older model versions. GGUF now actually has no hard limit to max context since it switched to using allocators, but it's not be compatible with older models. Additionally, models not trained with extended context are unlikely to work when RoPE scaled beyond 32k.
  • Added a simple OpenAI compatible completions API, which you can access at /v1/completions. You're still recommended to use the Kobold API as it has many more settings.
  • Increased stop_sequence limit to 16.
  • Improved SSE streaming by batching pending tokens between events.
  • Upgraded Lite polled-streaming to work even in multiuser mode. This works by sending a unique key for each request.
  • Improved Makefile to reduce unnecessary builds, added flag for skipping K-quants.
  • Enhanced Remote-Link.cmd to also work on Linux, simply run it to create a Cloudflare tunnel to access koboldcpp anywhere.
  • Improved the default colab notebook to use mmq.
  • Updated Lite and pulled other fixes and improvements from upstream llama.cpp.

Important: Deprecation Notice for KoboldCpp 1.45.1

The following command line arguments are considered deprecated and will be removed soon, in a future version.

--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.

Hotfix for 1.45.2 - Fixed a bug with reading thread counts in 1.45 and 1.45.1, also moved the OpenAI endpoint from /api/extra/oai/v1/completions to just /v1/completions

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.44.2

20 Sep 10:27
Compare
Choose a tag to compare

koboldcpp-1.44.2

A.K.A The "Mom: we have SillyTavern at home edition"

  • Added multi-user mode with --multiuser which allows up to 5 concurrent incoming /generate requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the /check and /abort endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request.
  • Added a new launcher argument --onready which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc.
  • Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
  • Added a new API endpoint /api/extra/true_max_context_length which allows fetching the true max context limit, separate from the horde-friendly value.
  • Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
  • Tweaked automatic RoPE scaling
  • Pulled other fixes and improvements from upstream.
  • Note: Using --usecublas with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.

Major Update for Kobold Lite:

taverny

  • Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
  • Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
  • Added Mirostat UI configs to settings panel.
  • Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
  • Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
  • Added support for automagically increasing the max context size slider limit, if a larger context is detected.
  • Added scenario for importing characters from Chub.Ai
  • Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the --stream flag is still kept for compatibility.
  • Added a few Instruct Tag Presets in a dropdown.
  • Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like {{[INPUT]}} and {{[OUTPUT]}}
  • Added a toggle for "Newline After Memory" which can be set in the memory panel.
  • Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
  • You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
  • Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.

v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.43

07 Sep 09:04
Compare
Choose a tag to compare

koboldcpp-1.43

  • Re-added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a --ropeconfig. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified --contextsize. Setting --ropeconfig will override this. This was bugged and removed in the previous release, but it should be working fine now.
  • HIP and CUDA visible devices set to that GPU only, if GPU number is provided and tensor split is not specified.
  • Fixed RWKV models being broken after recent upgrades.
  • Tweaked --unbantokens to decrease the banned token logit values further, as very rarely they could still appear. Still not using -inf as that causes issues with typical sampling.
  • Integrate SSE streaming improvements from @kalomaze
  • Added mutex for thread-safe polled-streaming from @Elbios
  • Added support for older GGML (ggjt_v3) for 34B llama2 models by @vxiiduu, note that this may still have issues if n_gqa is not 1, in which case using GGUF would be better.
  • Fixed support for Windows 7, which should work in noavx2 and failsafe modes again. Also, SSE3 flags are now enabled for failsafe mode.
  • Updated Kobold Lite, now uses placeholders for instruct tags that get swapped during generation.
  • Tab navigation order improved in GUI launcher, though some elements like checkboxes still require mouse to toggle.
  • Pulled other fixes and improvements from upstream.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

Of Note:

  • Reminder that HIPBLAS requires self compilation, and is not included by default in the prebuilt executables.
  • Remember that token unbans can now be set via API (and Lite) in addition to the command line.

koboldcpp-1.42.1

30 Aug 15:23
Compare
Choose a tag to compare

koboldcpp-1.42.1

  • Added support for LLAMA GGUFv2 models, handled automatically. All older models will still continue to work normally.
  • Fixed a problem with certain logit values that were causing segfaults when using the Typical sampler. Please let me know if it happens again.
  • Merged rocm support from @YellowRoseCx so you should now be able to build AMD compatible GPU builds with HIPBLAS, which should be faster than using CLBlast.
  • Merged upstream support for GGUF Falcon models. Note that GPU layer offload for Falcon is unavailable with --useclblast but works with CUDA. Older pre-gguf Falcon models are not supported.
  • Added support for unbanning EOS tokens directly from API, and by extension it can now be triggered from Lite UI settings. Note: Your command line --unbantokens flag will force override this.
    - Added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a --ropeconfig. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified --contextsize. Setting --ropeconfig will override this. (reverted in 1.42.1 for now, it was not setup correctly)
  • Updated Kobold Lite, now with tavern style portraits in Aesthetic Instruct mode.
  • Pulled other fixes and improvements from upstream.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.41 (beta)

24 Aug 13:51
Compare
Choose a tag to compare

koboldcpp-1.41 (beta)

It's been a while since the last release and quite a lot upstream has changed under the hood, so consider this release a beta.

  • Added support for LLAMA GGUF models, handled automatically. All older models will still continue to work normally. Note that GGUF format support for other non-llama architectures has not been added yet.
  • Added --config flag to load a .kcpps settings file when launching from command line (Credits: @poppeman), these files can also be imported/exported from the GUI.
  • Added a new endpoint /api/extra/tokencount which can be used to tokenize and accurately measure how many tokens any string has.
  • Fix for bell characters occasionally causing the terminal to beep in debug mode.
  • Fix for incorrect list of backends & missing backends displayed in the GUI.
  • Set MMQ to be the default for CUDA when running from GUI.
  • Updated Lite, and merged all the improvements and fixes from upstream.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.40.1

09 Aug 04:55
Compare
Choose a tag to compare

koboldcpp-1.40.1

This release is mostly for bugfixes to the previous one, but enough small stuff has changed that I chose to make it a new version instead of a patch for the previous one.

  • Fixed a regression in format detection for LLAMA 70B.
  • Converted the embedded horde worker into daemon mode, hopefully solves the occasional exceptions
  • Fixed some OOMs for blasbatchsize 2048, adjusted buffer sizes
  • Slight modification to the look ahead (2 to 5%) for the cuda pool malloc.
  • Pulled some bugfixes from upstream
  • Added a new field idle for the /api/extra/perf endpoint, allows checking if a generation is in progress without sending one.
  • Fixed cmake compilation for cudatoolkit 12.
  • Updated Lite, includes option for aesthetic instruct UI (early beta by @Lyrcaxis, please send them your feedback)

hotfix 1.40.1:

  • handle stablecode-completion-alpha-3b

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.39.1

07 Aug 07:36
Compare
Choose a tag to compare

koboldcpp-1.39.1

  • Fix SSE streaming to handle headers correctly during abort (Credits: @duncannah)
  • Bugfix for --blasbatchsize -1 and 1024 (fix alloc blocks error)
  • Added experimental support for --blasbatchsize 2048 (note, buffers are doubled if that is selected, using much more memory)
  • Added support for 12k and 16k --contextsize options. Please let me know if you encounter issues.
  • Pulled upstream improvements, further CUDA speedups for MMQ mode for all quant types.
  • Fix for some LLAMA 65B models being detected as LLAMA2 70B models.
  • Revert to upstream approach for CUDA pool malloc (1.39.1 - done only for MMQ).
  • Updated Lite, includes adding support for importing Tavern V2 card formats, with world info (character book) and clearer settings edit boxes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.38

02 Aug 15:14
Compare
Choose a tag to compare

koboldcpp-1.38

image

  • Added upstream support for Quantized MatMul (MMQ) prompt processing, a new option for CUDA (enabled by adding --usecublas mmq or toggle in GUI). This uses slightly less memory, and is slightly faster for Q4_0 but slower for K-quants.
  • Fixed SSE streaming for multibyte characters (For Tavern compatibility)
  • --noavx2 mode now does not use OpenBLAS (same as Failsafe), this is due to numerous compatibility complaints.
  • GUI dropdown preset only displays built platforms (Credit: @YellowRoseCx)
  • Added a Help button in the GUI
  • Fixed an issue with mirostat not reading correct value from GUI
  • Fixed an issue with context size slider being limited to 4096 in the GUI
  • Displays a terminal warning if received context exceeds max launcher allocated context

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.37.1

24 Jul 15:22
Compare
Choose a tag to compare

koboldcpp-1.37.1

  • NEW: KoboldCpp now comes with an embedded Horde Worker which allows anyone to share their ggml models with the AI Horde without downloading additional dependences. --hordeconfig now accepts 5 parameters [hordemodelname] [hordegenlength] [hordemaxctx] [hordeapikey] [hordeworkername], filling up all 5 will start a Horde worker for you that serves horde requests automatically in the background. For previous behavior, exclude the last 2 parameters to continue using your own Horde worker (e.g. HaidraScribe/KAIHordeBridge). This feature can also be enabled via the GUI.
  • Added Support for LLAMA2 70B models. This should work automatically, GQA will be set to 8 if it's detected.
  • Fixed a bug with mirostat v2 that was causing overly deterministic results. Please try it again. (Credit: @ycros)
  • Added addition information to /api/extra/perf for the last generation info, including the stopping reason as well as generated token counts.
  • Exposed the parameter for --tensor_split which works exactly like it does upstream. Only for CUDA.
  • Try to support Kepler as a target for CUDA as well on henky's suggestion, can't guarantee it will work as I don't have a K80, but it might.
  • Retained support for --blasbatchsize 1024 after it was removed upstream. Scratch & KV buffer sizes will be larger when using this.
  • Minor bugfixes, pulled other upstream fixes and optimizations, updated Kobold Lite (chat mode improvements)

Hotfix 1.37.1

  • Fixed clblast to work correctly for LLAMA2 70B
  • Fixed sending Client-Agent for embedded horde worker in addition to Bridge Agent and User Agent
  • Changed rms_norm_eps to 5e-6 for better results for both llama1 and 2
  • Fixed some streaming bugs in Lite

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.