Skip to content

LLM Optimize is a proof-of-concept library for doing LLM (large language model) guided blackbox optimization.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



11 Commits

Repository files navigation


LLM Optimize is a proof-of-concept library for doing LLM (large language model) guided blackbox optimization.

autoML example

Blue represents the "x", green the "f(x)", and yellow the LLM optimization step. The LLM is optimizing the code to improve generalization and showing it's thought process.


Traditional Optimization

There's a ton of different ways libraries do blackbox optimization. It mainly comes down to defining a function that takes a set of float params and converts them into a score, some bounds/constraints, and then an algorthm strategically varies the params to maximize (or minimize) the value outputted by the function. It's referred to as "blackbox" optimization because the function f() can be any arbitrary function (although ideally continuous and/or convox).

Here's an example with black-box:

import black_box as bb

def f(par):
    return par[0]**2 + par[1]**2  # dummy example

best_params = bb.search_min(f = f,  # given function
                            domain = [  # ranges of each parameter
                                [-10., 10.],
                                [-10., 10.]
                            budget = 40,  # total number of function calls available
                            batch = 4,  # number of calls that will be evaluated in parallel
                            resfile = 'output.csv')  # text file where results will be saved

LLM-guided Optimization

The idea behind LLM optimization is for a chat LLM model like GPT-4 to carry out the entire optimization process.

The example above could be written something like this:

x0 = "[0, 0]"

task = "Decrease the value of f(x). The values of x must be [-10, 10]."
question = "What is the next x to try such that f(x) is smaller?"

def f(x):
   x_array = parse(x)
   score = x_array[0]**2 + x_array[1]**2
   return (-score, f'Score = {score}'), question, f, x0=x0)

While this is several magnitudes less efficent for this problem, the language-based definition allows for signficantly more complex optimization problems that are just not possible with the purely numerical methods. For instance, code golf:

x0 = """
... python code ...

task = "Make this code as short as possible while maintaining correctness"
question = "What is the next x to try such that the code is smaller?"

def f(x):
   func = eval(x)
   correct = run_correctness_tests(func)
   score = len(x)
   return (-score, f'Correct = {correct}, Length = {score}'), question, f, x0=x0)

Interesting benefits of this approach:

  • Optimize arbitrary text/code strings
  • Each step comes with an explanation
  • Can optimize for complex natural language objective functions


See the full code for these in /examples.


By setting X to the source code for training a model, you can have the LLM not just perform traditional hyperparameter tuning, but actually re-write the model code to improve generalization.

from sklearn import svm

clf = svm.SVC(), y_train)
Actual Example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

from llm_optimize import optimize, eval_utils

digits = load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

X_train, X_test, y_train, y_test = train_test_split(data,, test_size=0.5, shuffle=False)

x0 = """
from sklearn import svm

clf = svm.SVC(), y_train)

TASK = f"""
You will be given sklearn modeling code as the input to optimize.

Vary functions, imports, arguments, model type, etc to perform this task to the best of your abilities.

* The script should always create a "clf" variable that is a sklearn estimator
* "clf" should always be set to the best estimator
* Do not use models that are not builtin to sklearn (do not pip install!)
* Be sure to include relevant sklearn imports
* Do not try to compute the test score

* X_train.shape == {X_train.shape}
* n_samples == {n_samples}

What is the next x to try such that the test score increases and the model better generalizes?

def train_model(script):
        result = eval_utils.exec_with_timeout_unsafe(
            script, {"X_train": X_train, "y_train": y_train}, ["clf"], timeout_secs=60
        clf = result["clf"]
        score = clf.score(X_test, y_test)
        return (score, f"Test Set Score = {score}")
    except Exception as e:
        return (0.0, "Exception " + str(e))

if __name__ == "__main__":
    best_code =, QUESTION, train_model, x0=x0, stop_score=1.0, max_steps=3)


As a toy example (based on this), one could have a piece of code that "generates" paper clips and the LLM can optimize it to generate more paperclips in a fixed amount of time.

import requests

Actual Example
import subprocess
import requests
import time

from llm_optimize import optimize, eval_utils

HOST = "http://localhost:8800"

x0 = f"""
import requests


TASK = f"""
You are a python agent tasked with generating code to create the most paperclips in {TIMEOUT} seconds.

The paperclips are generated by making a GET request with `paperclip` to a local flask HTTP server.

Very functions, imports, threading, processing to generate as many paperclips as possible.

* Only use python3.8 builtins and the requests library (no pip install)
* You must make requests to {HOST}
* If the script timeouts, you make no paperclips

What is the next x to try such that more paper clips are generated?

def run_factory(script):
    start = time.time()
        eval_utils.exec_temp_script_unsafe(script, timeout=TIMEOUT + 5)
        assets = requests.get(f"{HOST}/assets").json()
        score = assets.get("paperclip", 0)
        time_elapsed = time.time() - start
        return (
            f"Factory Assets = {assets}, Time = {time_elapsed:.3f}s, Paperclips Generated = {score}",
    except RuntimeError as e:
        return (0.0, repr(e))
    except subprocess.TimeoutExpired:
        time_elapsed = time.time() - start
        return (0.0, f"Timeout, Time = {time_elapsed:.3f}s")

if __name__ == "__main__":
    best_code =, QUESTION, run_factory, x0=x0, stop_score=1e9, max_steps=10)

Text Rubric

The optimization can also involve a mix of complex concepts and objectives. For instance, given a rubric about a piece of text, optimize it the text to achieve a better score. A separate session with the LLM is used as the scoring function.

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that "learn"

The task would optimize for a score on the rubric:

Rate the following text, using the rubric:
* Describes machine learning (1-10)
* Is a palindrome (1-10)
* Is at least 5 words (1-10)
Actual Example
import re

from llm_optimize import optimize, eval_utils

x0 = f"""
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that "learn"

TASK = f"""
You are a linguistics expert who can write complex sentences.

You are tasked with writing a statement that:
* Describes machine learning
* Is a palindrome
* Is at least 5 words

What is the next x to try such that the text better describes machine learning and is a palindrome?

RUBRIC = """
Rate the following text, using the rubric:
* Describes machine learning (1-10)
* Is a palindrome (1-10)
* Is at least 5 words (1-10)


At the end respond with `final_score=score` (e.g. `final_score=5`).

The final score should represent the overall ability of the text to meet the rubric.

if __name__ == "__main__":
    scorer = eval_utils.get_llm_scorer(
        RUBRIC, parse_score=lambda result: float(re.findall("final_score=([\d\.]+)", result)[0])
    best_code =, QUESTION, scorer, x0=x0, stop_score=10.0, max_steps=3)


See the examples for basic usage.


  1. pip install git+
  2. Set the environment variable OPENAI_API_KEY

Change Model

from llm_optimize import llm



Future Work

  • Using sandboxed environments for evaluating generated code in a safe space
  • Let the llm have access to tools/plugins (e.g. for AutoML a dataset analysis tool)
  • Optimizing the chat-as-optimization prompt to run ideas parallel
  • Mix with numerical methods for better performance (speed and efficacy)
  • Fixed x->(fx) context window to save on token costs, currently the entire optimization history is sent
  • Mid-optimization human-in-the-loop guidance to help converge
  • Do you even need x0?


LLM Optimize is a proof-of-concept library for doing LLM (large language model) guided blackbox optimization.






