Prompt Learning: Using English Feedback to Optimize LLM Systems

This repository contains research on Prompt Learning (PL), a novel approach to optimizing LLM prompts using natural language feedback instead of numerical scores.

Research Overview

Problem Statement

As more AI agents get deployed, a critical question emerges: can reinforcement learning control systems be built only in prompts? Traditional reinforcement learning relies on numerical scores and gradient updates, but prompts are the primary interface for guiding large language models. Can we apply RL-style methods to prompt refinement using natural language feedback?

Key Contributions

English Error Terms: Natural language feedback instead of numerical scores
Online Prompt Management: Continuous improvement system designed for production
Instruction Management: Built-in handling of competing, expiring, and human-reviewed instructions
Single-Example Learning: Powerful changes using individual examples instead of thousands
Cost Efficiency: One-tenth to one-hundredth the labeled examples of traditional RL

What Is Prompt Learning?

Prompt learning differs from traditional prompt optimization in two major ways:

1. English Error Terms

Instead of numerical scores, prompt learning uses English error terms - feedback provided in plain English text. For example:

"The generated JSON breaks several rules: the top-level key should be 'page', missing required 'updatedAt' field, and section types must use allowed vocabulary."

The English error term allows for direct feedback that can be used to tune instructions, solving problems that are unsolvable by current pure prompt optimization techniques.

2. Online Approach

Prompt learning is designed to run continually against your prompt, tuning instructions back into the context. LLM-based systems can assist with context engineering based on English error terms from evaluation explanations or human annotations.

Key Differences

Aspect	Prompt Learning	Reinforcement Learning	Prompt Optimization
Feedback Mechanism	Evaluation-based English explanations and human annotations	Numeric rewards	Numeric scores
Optimization	Metaprompt defines optimization approach	Updating model based on gradients	Varied but some support metaprompts
Prompt Control	Can optimize only specific section of prompt (instruction section)	N/A	Typically optimizes whole prompt
Online Setup	Designed to be used always on, with human control of "prompt change" acceptance	Designed to be used online	Normally one-off

How Does the Optimization Loop Work?

In many real-world use cases, a single optimization run with a single shot output works great. For cases requiring multiple loops:

English Explanation (Critique): Generated by evaluator to explain why an example failed
Feedback Loop: Results used to improve the prompt
Iterative Improvement: As more instructions are added to fix the prompt, the iterative loop becomes more important

Key Insight: In cases where only 1-10 instructions needed to be added, a single meta-prompt improvement loop was sufficient.

Experimental Design

Test Application: JSON Generation

We chose a JSON generation problem where models had to generate JSON for a webpage based on natural language prompts. We generated latent rules that responses needed to follow:

Every section needs a type value from a predefined list
All images must include alt text
All external asset links must use https

Evaluation Setup

Mimics Real-World: Designed to mimic typical evaluation cycle of an agent
Mixed Evaluation: LLM-as-a-judge techniques with human review
Latent Rules: Rules were implicitly represented in feedback and explanations
Optimization: Used modified version of meta-prompting dubbed "prompt learning"

Performance Results

Table 1: Prompt Learning Performance

Ruleset Size	Initial Accuracy	Baseline Accuracy w/ Ruleset	Test Accuracy - 1 loop	Test Accuracy - 5 loops
10	0%	41%	84%	100%
50	0%	14%	66%	82%
100	0%	6%	0%	67%

Key Findings

Uncovers Latent Rules: Prompt learning can uncover and address the majority of latent rules within the 5-25 ruleset range
Performance Stability: As more rules are introduced, performance does not drop significantly
Dramatic Improvement: Prompt learning drastically outperforms un-optimized cases (near-zero baseline accuracy)
Cost Efficiency: Achieves same level of improvements as reinforcement learning with one-tenth to one-hundredth the number of labeled examples

Repository Structure

Prompt-Learning/
├── data/                   # Datasets and evaluation data
│   ├── queries.csv        # Main dataset
│   └── README.md          # Data documentation
├── prompts/               # Prompt templates for different rule counts
│   ├── evaluator-prompt-10.txt
│   ├── evaluator-prompt-50.txt
│   ├── evaluator-prompt-100.txt
│   ├── rule-checker-prompt-10.txt
│   ├── rule-checker-prompt-50.txt
│   ├── rule-checker-prompt-100.txt
│   ├── metrics-prompt-10.txt
│   ├── metrics-prompt-50.txt
│   └── metrics-prompt-100.txt
├── notebooks/             # Jupyter notebooks for experiments      
│   └── prompt_learning_cookbook_AX.ipynb
├── meta_prompt.py         # Core meta-prompt implementation
├── meta_prompt_optimizer.py # Meta-prompt optimizer
├── prompt_learning_run.py # Main experiment runner
├── tiktoken_splitter.py   # Token counting utilities
├── train.csv              # Training dataset
├── test.csv               # Test dataset
├── requirements.txt       # Python dependencies
└── README.md             # This file

Quick Start

Installation

pip install -r requirements.txt

Environment Setup

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

import pandas as pd
from meta_prompt_optimizer import MetaPromptOptimizer

# Create dataset with English feedback
dataset = pd.DataFrame({
    'input': ["Generate a tech company's career page"],
    'output': ["{incorrect JSON output}"],
    'feedback': ["The generated JSON breaks several rules: missing 'updatedAt' field, top-level key should be 'page'"]
})

# Initialize optimizer
optimizer = MetaPromptOptimizer(
    prompt="You are an expert in JSON webpage creation. Generate: {input}",
    model_choice="gpt-4"
)

# Optimize the prompt using English feedback
optimized_prompt = optimizer.optimize(
    dataset=dataset,
    output_column='output',
    feedback_columns=['feedback']
)

Running Experiments

# Single experiment
python prompt_learning_run.py

# Multi-rule experiments (10, 50, 100 rules)
# Edit RUN_MULTI_RULE_EXPERIMENTS = True in prompt_learning_run.py

Key Innovations

1. English Error Terms

Natural language feedback instead of numerical scores
Direct integration of explanations into prompt optimization
Solves problems unsolvable by pure prompt optimization

2. Online Prompt Management

Continuous improvement system
Human control of prompt change acceptance
Built-in instruction lifecycle management

3. Single-Example Learning

Powerful changes using individual examples
Information-rich feedback from explanations
Leverages existing eval traces or human annotations

4. Instruction Management

Handles competing instructions
Supports instruction expiration
Enables human review gates
Deduplication and versioning

Comparison with Other Approaches

vs. PromptAgent (ICLR '24)

Dimension	PromptAgent	Prompt Learning (PL)
Objective	Find single "expert-level" prompt maximizing numeric task score	Continuously maintain production prompt for self-healing
Optimizer	MCTS over prompt edits; each node = prompt, each edge = edit	Meta-prompt controller reading English critique
Update Granularity	Edits entire task prompt; final prompt frozen	Edits only Instruction section in fenced region
Use of Critiques	Generates constructive error feedback but text not kept in final prompt	Primary signal; English critique feeds meta-prompt
Conflict Management	None once search ends; manual pruning required	Built-in: deduplication, versioning, expiration
Online vs. Offline	Offline: heavy search then deployment	Online: one extra LLM call per failure
Data Requirement	Moderate-sized scored dev set	Works with single examples
Compute Cost	Front-loaded (search); negligible at inference	Minimal upfront, <1 extra call per optimization
Interpretability	Final prompt readable, reasoning hidden in search logs	Full audit trail: every instruction edit in plain English

Contributing

This is a research repository. For contributions:

Create a new branch for your experiment
Update this README with findings
Submit a pull request

Contact

For questions about the research, contact: pjindal@arize.com

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
notebooks		notebooks
prompts		prompts
.gitignore		.gitignore
README.md		README.md
prompt_learning_run.py		prompt_learning_run.py
requirements.txt		requirements.txt

PriyanJindal/Prompt-Learning

Folders and files

Latest commit

History

Repository files navigation

Prompt Learning: Using English Feedback to Optimize LLM Systems

Research Overview

Problem Statement

Key Contributions

What Is Prompt Learning?

1. English Error Terms

2. Online Approach

Key Differences

How Does the Optimization Loop Work?

Experimental Design

Test Application: JSON Generation

Evaluation Setup

Performance Results

Table 1: Prompt Learning Performance

Key Findings

Repository Structure

Quick Start

Installation

Environment Setup

Basic Usage

Running Experiments

Key Innovations

1. English Error Terms

2. Online Prompt Management

3. Single-Example Learning

4. Instruction Management

Comparison with Other Approaches

vs. PromptAgent (ICLR '24)

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages