From 564fa63a3ad5d72251f08a2f19580c42cd850b89 Mon Sep 17 00:00:00 2001
From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com>
Date: Sat, 21 Sep 2024 18:32:12 +0200
Subject: [PATCH 1/3] Added commands for running the proxy with a local server

Added a section in `Installation` that lists the needed commands for running the proxy with a local server
---
 README.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/README.md b/README.md
index 34272f4..9d764b4 100644
--- a/README.md
+++ b/README.md
@@ -58,6 +58,22 @@ python optillm.py
 2024-09-06 07:57:14,212 - INFO - Press CTRL+C to quit
 ```
 
+### Starting the optillm proxy for a local server (e.g. llama.cpp)
+
+- Set the `OPENAI_API_KEY` env variable to a placeholder value
+  - e.g. `export OPENAI_API_KEY="no_key"`
+- Run `./llama-server -m path_to_model` to start the server with the specified model
+- Run `python3 optillm.py --base_url base_url` to start the proxy
+  - e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
+
+> [!WARNING]
+> Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
+> `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, and `z3`.
+> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama) or [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).
+
+> [!NOTE]
+> You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.
+
 ## Usage
 
 Once the proxy is running, you can use it as a drop in replacement for an OpenAI client by setting the `base_url` as `http://localhost:8000/v1`.

From e703e04bc62b9de51796959dece2f7e3e3eb75bb Mon Sep 17 00:00:00 2001
From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com>
Date: Sat, 21 Sep 2024 18:52:02 +0200
Subject: [PATCH 2/3] Updated llama-server command for a bigger context length

Added the `-c` parameter to `llama-server` command in order to increase the context length to 4096 tokens, from the default of 2048 tokens.
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 9d764b4..903384e 100644
--- a/README.md
+++ b/README.md
@@ -62,7 +62,7 @@ python optillm.py
 
 - Set the `OPENAI_API_KEY` env variable to a placeholder value
   - e.g. `export OPENAI_API_KEY="no_key"`
-- Run `./llama-server -m path_to_model` to start the server with the specified model
+- Run `./llama-server -c 4096 -m path_to_model` to start the server with the specified model and a context length of 4096 tokens
 - Run `python3 optillm.py --base_url base_url` to start the proxy
   - e.g. for llama.cpp, run `python3 optillm.py --base_url http://localhost:8080/v1`
 

From f7ad745089139416723596eb362e36134d5c86f7 Mon Sep 17 00:00:00 2001
From: Andrea Marano <68614754+LuMarans30@users.noreply.github.com>
Date: Sat, 21 Sep 2024 20:32:51 +0200
Subject: [PATCH 3/3] Removed llama-cpp-python as it still doesn't support
 sampling multiple responses

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 903384e..1f97e2e 100644
--- a/README.md
+++ b/README.md
@@ -69,7 +69,7 @@ python optillm.py
 > [!WARNING]
 > Note that llama-server currently does not support sampling multiple responses from a model, which limits the available approaches to the following:
 > `cot_reflection`, `leap`, `plansearch`, `rstar`, `rto`, `self_consistency`, and `z3`.
-> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama) or [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).
+> In order to use other approaches, consider using an alternative compatible server such as [ollama](https://github.com/ollama/ollama).
 
 > [!NOTE]
 > You'll later need to specify a model name in the OpenAI client configuration. Since llama-server was started with a single model, you can choose any name you want.