Add Llamacpp backend (clp-research#81)

* Add context limit check function to backends/util.py * Add model entries to registry * Add handling of optional model loading flags for CPU/GPU usage and GPU layer offload * Add openchat_3.5-GGUF-q5 to model registry * Add llama.cpp backend howto (cherry picked from commit 94493c2)
Gnurro · Apr 25, 2024 · d7e7d00 · d7e7d00
1 parent 42ee939
commit d7e7d00
Show file tree

Hide file tree

Showing 4 changed files with 41 additions and 6 deletions.
diff --git a/backends/llamacpp_api.py b/backends/llamacpp_api.py
@@ -195,4 +195,4 @@ def generate_response(self, messages: List[Dict], return_full_text: bool = False
         else:
             response_text = prompt_text + model_output['choices'][0]['text'].strip()
 
-        return prompt, response, response_text
+        return prompt, response, response_text
diff --git a/backends/utils.py b/backends/utils.py
@@ -87,4 +87,4 @@ def check_context_limit_generic(context_size: int, prompt_tokens: List, model_na
         raise ContextExceededError(f"Context token limit for {model_name} exceeded",
                                    tokens_used=tokens_used, tokens_left=tokens_left, context_size=context_size)
 
-    return fits, tokens_used, tokens_left, context_size
+    return fits, tokens_used, tokens_left, context_size
diff --git a/docs/howto_use_llama-cpp_backend.md b/docs/howto_use_llama-cpp_backend.md
@@ -6,21 +6,30 @@ hardware backend and operating system, and models may need to be loaded with spe
 [Setup](#setup)  
 [Model loading](#model-loading)
 ## Setup
-lorem ipsum
+The clembench llama.cpp backend relies on the llama-cpp-python library, which wraps C++ llama.cpp. To allow the usage of 
+specific hardware, specially GPUs, the installation must include a fitting version of llama.cpp. This may entail 
+compiling llama.cpp, but pre-compiled versions for specific hardware are available.  
+Since this is specific to the available hardware, please refer to the [llama-cpp-python installation instructions](https://llama-cpp-python.readthedocs.io/en/latest/#installation) 
+to install the library. It is recommended to use one of the pre-built wheels for the available hardware, as this does not require a C++ compiler 
+and compiling llama.cpp during the installation.
 ### Sample setup script
 The following example shell script installs the clembench llama.cpp backend with support for CUDA 12.2 GPUs:
 ```shell
+# create separate venv for running the llama.cpp backend:
 python3 -m venv venv_llamacpp
 source venv_llamacpp/bin/activate
+# install basic clembench requirements:
 pip3 install -r requirements.txt
-# install using pre-built wheel with CUDA 12.2 support:
+# install llama-cpp-python using pre-built wheel with CUDA 12.2 support:
 pip3 install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
 ```
 ## Model loading
+The clembench llama.cpp backend downloads model files from HuggingFace model repositories. See the [model registry readme](model_backend_registry_readme.md).  
 By default, the clembench llama.cpp backend loads all model layers onto the available GPU(s). This requires that during 
 setup, proper llama.cpp GPU support fitting the system hardware was installed.  
-Optionally, models can be loaded to run on CPU (using RAM instead of GPU VRAM). This can be done by passing a JSON 
-object to the clembench CLI scripts, or a Python `dict` to the model loading function of the clembench `backends`.  
+Optionally, models can be loaded to run on CPU (using RAM instead of GPU VRAM). This is required if llama-cpp-python was 
+installed without GPU support. This can be done by passing a JSON object to the clembench CLI scripts, or a Python `dict` 
+to the model loading function of the clembench `backends`.  
 The JSON object/`dict` has to contain the model name as defined in the [model registry](model_backend_registry_readme.md) 
 and the key `execute_on` with string value `gpu` or `cpu`:
 ```python

diff --git a/docs/model_backend_registry_readme.md b/docs/model_backend_registry_readme.md
@@ -17,6 +17,32 @@ The following key/values are **optional**, but should be defined for models that
 `custom_chat_template`(string): A jinja2 template string of the chat template to be applied for this model. This should be set if `premade_chat_template` is `false` for the model, as the generic fallback chat template that will be used if this is not defined is likely to lead to bad model performance.  
 `slow_tokenizer`(bool): If `true`, the backend will load the model's tokenizer with `use_fast=False`. Some models require the use of a 'slow' tokenizer class to assure proper tokenization.  
 `output_split_prefix`(string): The model's raw output will be rsplit using this string, and the remaining output following this string will be considered the model output. This is necessary for some models that decode tokens differently than they encode them, to assure that the prompt is properly removed from model responses. Example: `assistant\n`
+### llama.cpp Backend
+This backend requires these **mandatory** key/values:  
+`huggingface_id`(string): The full huggingface model ID; huggingface user name / model name. Example: `TheBloke/openchat_3.5-GGUF`  
+`filename`(string): This is a string used as a regular expression to download the specific model file for a specific 
+quantization/version of the model on the HuggingFace repository. It is case-sensitive. Please check the repository 
+defined in `huggingface_id` for the proper file name. Example: `*Q5_0.gguf` for the q5 version of `openchat_3.5-GGUF` on 
+the `TheBloke/openchat_3.5-GGUF` repository.
+`premade_chat_template`(bool): If `true`, the chat template that is applied for generation is loaded from the model 
+repository on huggingface. If `false`, the value of `custom_chat_template` will be used if defined, otherwise a generic 
+chat template is applied (highly discouraged).  
+`eos_to_cull`(string): This is the string representation of the model's EOS token. It needs to be removed by the backend to assure proper processing by clembench. Example: `<|im_end|>` (This is mandatory as there are models that do not define this in their tokenizer configuration.)  
+
+The following key/values are **optional**, but should be defined for models that require them for proper functioning:  
+`requires_api_key`(bool): If `true`, the backend will load a huggingface api access key/token from `key.json`, which is required to access 'gated' models like Meta's Llama2.  
+`custom_chat_template`(string): A jinja2 template string of the chat template to be applied for this model. This should be set if `premade_chat_template` is `false` for the model, as the generic fallback chat template that will be used if this is not defined is likely to lead to bad model performance.  
+`bos_string` (string): In case the model file does not contain a predefined BOS token, this string will be used to 
+create the logged input prompt.  
+`eos_string` (string): In case the model file does not contain a predefined EOS token, this string will be used to 
+create the logged input prompt.  
+`output_split_prefix`(string): The model's raw output will be rsplit using this string, and the remaining output following this string will be considered the model output. This is necessary for some models that decode tokens differently than they encode them, to assure that the prompt is properly removed from model responses. Example: `assistant\n`
+#### Advanced
+These key/values are recommended to only be used with a custom registry file:
+`execute_on` (string): Either `gpu`, to run the model with all layers loaded to GPU using VRAM, or `cpu` to run the model on CPU 
+only, using main RAM. `gpu` requires a llama.cpp installation with GPU support, `cpu` one with CPU support.  
+`gpu_layers_offloaded` (integer): The number of model layers to offload to GPU/VRAM. This requires a llama.cpp 
+installation with GPU support. This key is only used if there is no `execute_on` key in the model entry.
 # Backend Classes
 Model registry entries are mainly used for two classes: `backends.ModelSpec` and `backends.Model`.
 ## ModelSpec