LLM Optimize is a proof-of-concept library for doing LLM (large language model) guided blackbox optimization.
Blue represents the "x", green the "f(x)", and yellow the LLM optimization step. The LLM is optimizing the code to improve generalization and showing it's thought process.
There's a ton of different ways libraries do blackbox optimization. It mainly comes down to defining a function that takes a set of float params and converts them into a score, some bounds/constraints, and then an algorthm strategically varies the params to maximize (or minimize) the value outputted by the function. It's referred to as "blackbox" optimization because the function f()
can be any arbitrary function (although ideally continuous and/or convox).
Here's an example with black-box
:
import black_box as bb
def f(par):
return par[0]**2 + par[1]**2 # dummy example
best_params = bb.search_min(f = f, # given function
domain = [ # ranges of each parameter
[-10., 10.],
[-10., 10.]
],
budget = 40, # total number of function calls available
batch = 4, # number of calls that will be evaluated in parallel
resfile = 'output.csv') # text file where results will be saved
The idea behind LLM optimization is for a chat LLM model like GPT-4 to carry out the entire optimization process.
The example above could be written something like this:
x0 = "[0, 0]"
task = "Decrease the value of f(x). The values of x must be [-10, 10]."
question = "What is the next x to try such that f(x) is smaller?"
def f(x):
x_array = parse(x)
score = x_array[0]**2 + x_array[1]**2
return (-score, f'Score = {score}')
optimize.run(task, question, f, x0=x0)
While this is several magnitudes less efficent for this problem, the language-based definition allows for signficantly more complex optimization problems that are just not possible with the purely numerical methods. For instance, code golf:
x0 = """
... python code ...
"""
task = "Make this code as short as possible while maintaining correctness"
question = "What is the next x to try such that the code is smaller?"
def f(x):
func = eval(x)
correct = run_correctness_tests(func)
score = len(x)
return (-score, f'Correct = {correct}, Length = {score}')
optimize.run(task, question, f, x0=x0)
Interesting benefits of this approach:
- Optimize arbitrary text/code strings
- Each step comes with an explanation
- Can optimize for complex natural language objective functions
See the full code for these in /examples.
By setting X to the source code for training a model, you can have the LLM not just perform traditional hyperparameter tuning, but actually re-write the model code to improve generalization.
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
Actual Example
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from llm_optimize import optimize, eval_utils
digits = load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)
x0 = """
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
"""
TASK = f"""
You will be given sklearn modeling code as the input to optimize.
Vary functions, imports, arguments, model type, etc to perform this task to the best of your abilities.
Rules:
* The script should always create a "clf" variable that is a sklearn estimator
* "clf" should always be set to the best estimator
* Do not use models that are not builtin to sklearn (do not pip install!)
* Be sure to include relevant sklearn imports
* Do not try to compute the test score
Hints:
* X_train.shape == {X_train.shape}
* n_samples == {n_samples}
"""
QUESTION = """
What is the next x to try such that the test score increases and the model better generalizes?
"""
def train_model(script):
try:
result = eval_utils.exec_with_timeout_unsafe(
script, {"X_train": X_train, "y_train": y_train}, ["clf"], timeout_secs=60
)
clf = result["clf"]
score = clf.score(X_test, y_test)
return (score, f"Test Set Score = {score}")
except Exception as e:
return (0.0, "Exception " + str(e))
if __name__ == "__main__":
best_code = optimize.run(TASK, QUESTION, train_model, x0=x0, stop_score=1.0, max_steps=3)
print(best_code)
As a toy example (based on this), one could have a piece of code that "generates" paper clips and the LLM can optimize it to generate more paperclips in a fixed amount of time.
import requests
requests.get('{HOST}/make/paperclip')
requests.get('{HOST}/make/paperclip')
requests.get('{HOST}/make/paperclip')
Actual Example
import subprocess
import requests
import time
from llm_optimize import optimize, eval_utils
TIMEOUT = 30
HOST = "http://localhost:8800"
x0 = f"""
import requests
requests.get('{HOST}/make/paperclip')
requests.get('{HOST}/make/paperclip')
requests.get('{HOST}/make/paperclip')
"""
TASK = f"""
You are a python agent tasked with generating code to create the most paperclips in {TIMEOUT} seconds.
The paperclips are generated by making a GET request with `paperclip` to a local flask HTTP server.
Very functions, imports, threading, processing to generate as many paperclips as possible.
Rules:
* Only use python3.8 builtins and the requests library (no pip install)
* You must make requests to {HOST}
* If the script timeouts, you make no paperclips
"""
QUESTION = """
What is the next x to try such that more paper clips are generated?
"""
def run_factory(script):
requests.get(f"{HOST}/reset")
start = time.time()
try:
eval_utils.exec_temp_script_unsafe(script, timeout=TIMEOUT + 5)
assets = requests.get(f"{HOST}/assets").json()
score = assets.get("paperclip", 0)
time_elapsed = time.time() - start
return (
score,
f"Factory Assets = {assets}, Time = {time_elapsed:.3f}s, Paperclips Generated = {score}",
)
except RuntimeError as e:
return (0.0, repr(e))
except subprocess.TimeoutExpired:
time_elapsed = time.time() - start
return (0.0, f"Timeout, Time = {time_elapsed:.3f}s")
if __name__ == "__main__":
best_code = optimize.run(TASK, QUESTION, run_factory, x0=x0, stop_score=1e9, max_steps=10)
print(best_code)
The optimization can also involve a mix of complex concepts and objectives. For instance, given a rubric about a piece of text, optimize it the text to achieve a better score. A separate session with the LLM is used as the scoring function.
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that "learn"
The task would optimize for a score on the rubric:
Rate the following text, using the rubric:
* Describes machine learning (1-10)
* Is a palindrome (1-10)
* Is at least 5 words (1-10)
Actual Example
import re
from llm_optimize import optimize, eval_utils
x0 = f"""
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that "learn"
"""
TASK = f"""
You are a linguistics expert who can write complex sentences.
You are tasked with writing a statement that:
* Describes machine learning
* Is a palindrome
* Is at least 5 words
"""
QUESTION = """
What is the next x to try such that the text better describes machine learning and is a palindrome?
"""
RUBRIC = """
Rate the following text, using the rubric:
* Describes machine learning (1-10)
* Is a palindrome (1-10)
* Is at least 5 words (1-10)
``
{x}
``
At the end respond with `final_score=score` (e.g. `final_score=5`).
The final score should represent the overall ability of the text to meet the rubric.
"""
if __name__ == "__main__":
scorer = eval_utils.get_llm_scorer(
RUBRIC, parse_score=lambda result: float(re.findall("final_score=([\d\.]+)", result)[0])
)
best_code = optimize.run(TASK, QUESTION, scorer, x0=x0, stop_score=10.0, max_steps=3)
print(best_code)
See the examples for basic usage.
pip install git+https://github.com/sshh12/llm_optimize
- Set the environment variable
OPENAI_API_KEY
from llm_optimize import llm
llm.default_llm_options.update(model_name="gpt-4")
- Using sandboxed environments for evaluating generated code in a safe space
- Let the llm have access to tools/plugins (e.g. for AutoML a dataset analysis tool)
- Optimizing the chat-as-optimization prompt to run ideas parallel
- Mix with numerical methods for better performance (speed and efficacy)
- Fixed x->(fx) context window to save on token costs, currently the entire optimization history is sent
- Mid-optimization human-in-the-loop guidance to help converge
- Do you even need x0?