From 564fa63a3ad5d72251f08a2f19580c42cd850b89 Mon Sep 17 00:00:00 2001 From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com> Date: Sat, 21 Sep 2024 18:32:12 +0200 Subject: [PATCH 1/3] Added commands for running the proxy with a local server Added a section in `Installation` that lists the needed commands for running the proxy with a local server --- README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/README.md b/README.md index 34272f4..9d764b4 100644 --- a/README.md +++ b/README.md @@ -58,6 +58,22 @@ python optillm.py 2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit ``` +### Starting the optillm proxy for a local server (e.g. llama.cpp) + +- Set the `OPENAI_API_KEY` env variable to a placeholder value + - e.g. `export OPENAI_API_KEY="no_key"` +- Run `./llama-server -m path_to_model` to start the server with the specified model +- Run `python3 optillm.py --base_url base_url` to start the proxy + - e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1` + +> [!WARNING] +> Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following: +> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, and `z3`. +> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama) or [llama-cpp-python](https://github.com/abetlen/llama-cpp-python). + +> [!NOTE] +> You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want. + ## Usage Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the `base_url` as `http://localhost:8000/v1`. From e703e04bc62b9de51796959dece2f7e3e3eb75bb Mon Sep 17 00:00:00 2001 From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com> Date: Sat, 21 Sep 2024 18:52:02 +0200 Subject: [PATCH 2/3] Updated llama-server command for a bigger context length Added the `-c` parameter to `llama-server` command in order to increase the context length to 4096 tokens, from the default of 2048 tokens. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9d764b4..903384e 100644 --- a/README.md +++ b/README.md @@ -62,7 +62,7 @@ python optillm.py - Set the `OPENAI_API_KEY` env variable to a placeholder value - e.g. `export OPENAI_API_KEY="no_key"` -- Run `./llama-server -m path_to_model` to start the server with the specified model +- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens - Run `python3 optillm.py --base_url base_url` to start the proxy - e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1` From f7ad745089139416723596eb362e36134d5c86f7 Mon Sep 17 00:00:00 2001 From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com> Date: Sat, 21 Sep 2024 20:32:51 +0200 Subject: [PATCH 3/3] Removed llama-cpp-python as it still doesn't support sampling multiple responses --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 903384e..1f97e2e 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ python optillm.py > [!WARNING] > Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following: > `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, and `z3`. -> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama) or [llama-cpp-python](https://github.com/abetlen/llama-cpp-python). +> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama). > [!NOTE] > You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.