Skip to content

Commit

Permalink
Add Llamacpp backend (clp-research#81)
Browse files Browse the repository at this point in the history
* Add context limit check function to backends/util.py
* Add model entries to registry
* Add handling of optional model loading flags for CPU/GPU usage and GPU layer offload
* Add openchat_3.5-GGUF-q5 to model registry
* Add llama.cpp backend howto

(cherry picked from commit 94493c2)
  • Loading branch information
Gnurro committed Apr 25, 2024
1 parent 42ee939 commit d7e7d00
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 6 deletions.
2 changes: 1 addition & 1 deletion backends/llamacpp_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,4 +195,4 @@ def generate_response(self, messages: List[Dict], return_full_text: bool = False
else:
response_text = prompt_text + model_output['choices'][0]['text'].strip()

return prompt, response, response_text
return prompt, response, response_text
2 changes: 1 addition & 1 deletion backends/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,4 @@ def check_context_limit_generic(context_size: int, prompt_tokens: List, model_na
raise ContextExceededError(f"Context token limit for {model_name} exceeded",
tokens_used=tokens_used, tokens_left=tokens_left, context_size=context_size)

return fits, tokens_used, tokens_left, context_size
return fits, tokens_used, tokens_left, context_size
17 changes: 13 additions & 4 deletions docs/howto_use_llama-cpp_backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,30 @@ hardware backend and operating system, and models may need to be loaded with spe
[Setup](#setup)
[Model loading](#model-loading)
## Setup
lorem ipsum
The clembench llama.cpp backend relies on the llama-cpp-python library, which wraps C++ llama.cpp. To allow the usage of
specific hardware, specially GPUs, the installation must include a fitting version of llama.cpp. This may entail
compiling llama.cpp, but pre-compiled versions for specific hardware are available.
Since this is specific to the available hardware, please refer to the [llama-cpp-python installation instructions](https://llama-cpp-python.readthedocs.io/en/latest/#installation)
to install the library. It is recommended to use one of the pre-built wheels for the available hardware, as this does not require a C++ compiler
and compiling llama.cpp during the installation.
### Sample setup script
The following example shell script installs the clembench llama.cpp backend with support for CUDA 12.2 GPUs:
```shell
# create separate venv for running the llama.cpp backend:
python3 -m venv venv_llamacpp
source venv_llamacpp/bin/activate
# install basic clembench requirements:
pip3 install -r requirements.txt
# install using pre-built wheel with CUDA 12.2 support:
# install llama-cpp-python using pre-built wheel with CUDA 12.2 support:
pip3 install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
```
## Model loading
The clembench llama.cpp backend downloads model files from HuggingFace model repositories. See the [model registry readme](model_backend_registry_readme.md).
By default, the clembench llama.cpp backend loads all model layers onto the available GPU(s). This requires that during
setup, proper llama.cpp GPU support fitting the system hardware was installed.
Optionally, models can be loaded to run on CPU (using RAM instead of GPU VRAM). This can be done by passing a JSON
object to the clembench CLI scripts, or a Python `dict` to the model loading function of the clembench `backends`.
Optionally, models can be loaded to run on CPU (using RAM instead of GPU VRAM). This is required if llama-cpp-python was
installed without GPU support. This can be done by passing a JSON object to the clembench CLI scripts, or a Python `dict`
to the model loading function of the clembench `backends`.
The JSON object/`dict` has to contain the model name as defined in the [model registry](model_backend_registry_readme.md)
and the key `execute_on` with string value `gpu` or `cpu`:
```python
Expand Down
26 changes: 26 additions & 0 deletions docs/model_backend_registry_readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,32 @@ The following key/values are **optional**, but should be defined for models that
`custom_chat_template`(string): A jinja2 template string of the chat template to be applied for this model. This should be set if `premade_chat_template` is `false` for the model, as the generic fallback chat template that will be used if this is not defined is likely to lead to bad model performance.
`slow_tokenizer`(bool): If `true`, the backend will load the model's tokenizer with `use_fast=False`. Some models require the use of a 'slow' tokenizer class to assure proper tokenization.
`output_split_prefix`(string): The model's raw output will be rsplit using this string, and the remaining output following this string will be considered the model output. This is necessary for some models that decode tokens differently than they encode them, to assure that the prompt is properly removed from model responses. Example: `assistant\n`
### llama.cpp Backend
This backend requires these **mandatory** key/values:
`huggingface_id`(string): The full huggingface model ID; huggingface user name / model name. Example: `TheBloke/openchat_3.5-GGUF`
`filename`(string): This is a string used as a regular expression to download the specific model file for a specific
quantization/version of the model on the HuggingFace repository. It is case-sensitive. Please check the repository
defined in `huggingface_id` for the proper file name. Example: `*Q5_0.gguf` for the q5 version of `openchat_3.5-GGUF` on
the `TheBloke/openchat_3.5-GGUF` repository.
`premade_chat_template`(bool): If `true`, the chat template that is applied for generation is loaded from the model
repository on huggingface. If `false`, the value of `custom_chat_template` will be used if defined, otherwise a generic
chat template is applied (highly discouraged).
`eos_to_cull`(string): This is the string representation of the model's EOS token. It needs to be removed by the backend to assure proper processing by clembench. Example: `<|im_end|>` (This is mandatory as there are models that do not define this in their tokenizer configuration.)

The following key/values are **optional**, but should be defined for models that require them for proper functioning:
`requires_api_key`(bool): If `true`, the backend will load a huggingface api access key/token from `key.json`, which is required to access 'gated' models like Meta's Llama2.
`custom_chat_template`(string): A jinja2 template string of the chat template to be applied for this model. This should be set if `premade_chat_template` is `false` for the model, as the generic fallback chat template that will be used if this is not defined is likely to lead to bad model performance.
`bos_string` (string): In case the model file does not contain a predefined BOS token, this string will be used to
create the logged input prompt.
`eos_string` (string): In case the model file does not contain a predefined EOS token, this string will be used to
create the logged input prompt.
`output_split_prefix`(string): The model's raw output will be rsplit using this string, and the remaining output following this string will be considered the model output. This is necessary for some models that decode tokens differently than they encode them, to assure that the prompt is properly removed from model responses. Example: `assistant\n`
#### Advanced
These key/values are recommended to only be used with a custom registry file:
`execute_on` (string): Either `gpu`, to run the model with all layers loaded to GPU using VRAM, or `cpu` to run the model on CPU
only, using main RAM. `gpu` requires a llama.cpp installation with GPU support, `cpu` one with CPU support.
`gpu_layers_offloaded` (integer): The number of model layers to offload to GPU/VRAM. This requires a llama.cpp
installation with GPU support. This key is only used if there is no `execute_on` key in the model entry.
# Backend Classes
Model registry entries are mainly used for two classes: `backends.ModelSpec` and `backends.Model`.
## ModelSpec
Expand Down

0 comments on commit d7e7d00

Please sign in to comment.