This repository contains research on Prompt Learning (PL), a novel approach to optimizing LLM prompts using natural language feedback instead of numerical scores.
As more AI agents get deployed, a critical question emerges: can reinforcement learning control systems be built only in prompts? Traditional reinforcement learning relies on numerical scores and gradient updates, but prompts are the primary interface for guiding large language models. Can we apply RL-style methods to prompt refinement using natural language feedback?
- English Error Terms: Natural language feedback instead of numerical scores
- Online Prompt Management: Continuous improvement system designed for production
- Instruction Management: Built-in handling of competing, expiring, and human-reviewed instructions
- Single-Example Learning: Powerful changes using individual examples instead of thousands
- Cost Efficiency: One-tenth to one-hundredth the labeled examples of traditional RL
Prompt learning differs from traditional prompt optimization in two major ways:
Instead of numerical scores, prompt learning uses English error terms - feedback provided in plain English text. For example:
"The generated JSON breaks several rules: the top-level key should be 'page', missing required 'updatedAt' field, and section types must use allowed vocabulary."
The English error term allows for direct feedback that can be used to tune instructions, solving problems that are unsolvable by current pure prompt optimization techniques.
Prompt learning is designed to run continually against your prompt, tuning instructions back into the context. LLM-based systems can assist with context engineering based on English error terms from evaluation explanations or human annotations.
| Aspect | Prompt Learning | Reinforcement Learning | Prompt Optimization |
|---|---|---|---|
| Feedback Mechanism | Evaluation-based English explanations and human annotations | Numeric rewards | Numeric scores |
| Optimization | Metaprompt defines optimization approach | Updating model based on gradients | Varied but some support metaprompts |
| Prompt Control | Can optimize only specific section of prompt (instruction section) | N/A | Typically optimizes whole prompt |
| Online Setup | Designed to be used always on, with human control of "prompt change" acceptance | Designed to be used online | Normally one-off |
In many real-world use cases, a single optimization run with a single shot output works great. For cases requiring multiple loops:
- English Explanation (Critique): Generated by evaluator to explain why an example failed
- Feedback Loop: Results used to improve the prompt
- Iterative Improvement: As more instructions are added to fix the prompt, the iterative loop becomes more important
Key Insight: In cases where only 1-10 instructions needed to be added, a single meta-prompt improvement loop was sufficient.
We chose a JSON generation problem where models had to generate JSON for a webpage based on natural language prompts. We generated latent rules that responses needed to follow:
- Every section needs a type value from a predefined list
- All images must include alt text
- All external asset links must use https
- Mimics Real-World: Designed to mimic typical evaluation cycle of an agent
- Mixed Evaluation: LLM-as-a-judge techniques with human review
- Latent Rules: Rules were implicitly represented in feedback and explanations
- Optimization: Used modified version of meta-prompting dubbed "prompt learning"
| Ruleset Size | Initial Accuracy | Baseline Accuracy w/ Ruleset | Test Accuracy - 1 loop | Test Accuracy - 5 loops |
|---|---|---|---|---|
| 10 | 0% | 41% | 84% | 100% |
| 50 | 0% | 14% | 66% | 82% |
| 100 | 0% | 6% | 0% | 67% |
- Uncovers Latent Rules: Prompt learning can uncover and address the majority of latent rules within the 5-25 ruleset range
- Performance Stability: As more rules are introduced, performance does not drop significantly
- Dramatic Improvement: Prompt learning drastically outperforms un-optimized cases (near-zero baseline accuracy)
- Cost Efficiency: Achieves same level of improvements as reinforcement learning with one-tenth to one-hundredth the number of labeled examples
Prompt-Learning/
├── data/ # Datasets and evaluation data
│ ├── queries.csv # Main dataset
│ └── README.md # Data documentation
├── prompts/ # Prompt templates for different rule counts
│ ├── evaluator-prompt-10.txt
│ ├── evaluator-prompt-50.txt
│ ├── evaluator-prompt-100.txt
│ ├── rule-checker-prompt-10.txt
│ ├── rule-checker-prompt-50.txt
│ ├── rule-checker-prompt-100.txt
│ ├── metrics-prompt-10.txt
│ ├── metrics-prompt-50.txt
│ └── metrics-prompt-100.txt
├── notebooks/ # Jupyter notebooks for experiments
│ └── prompt_learning_cookbook_AX.ipynb
├── meta_prompt.py # Core meta-prompt implementation
├── meta_prompt_optimizer.py # Meta-prompt optimizer
├── prompt_learning_run.py # Main experiment runner
├── tiktoken_splitter.py # Token counting utilities
├── train.csv # Training dataset
├── test.csv # Test dataset
├── requirements.txt # Python dependencies
└── README.md # This file
pip install -r requirements.txtexport OPENAI_API_KEY="your-api-key-here"import pandas as pd
from meta_prompt_optimizer import MetaPromptOptimizer
# Create dataset with English feedback
dataset = pd.DataFrame({
'input': ["Generate a tech company's career page"],
'output': ["{incorrect JSON output}"],
'feedback': ["The generated JSON breaks several rules: missing 'updatedAt' field, top-level key should be 'page'"]
})
# Initialize optimizer
optimizer = MetaPromptOptimizer(
prompt="You are an expert in JSON webpage creation. Generate: {input}",
model_choice="gpt-4"
)
# Optimize the prompt using English feedback
optimized_prompt = optimizer.optimize(
dataset=dataset,
output_column='output',
feedback_columns=['feedback']
)# Single experiment
python prompt_learning_run.py
# Multi-rule experiments (10, 50, 100 rules)
# Edit RUN_MULTI_RULE_EXPERIMENTS = True in prompt_learning_run.py- Natural language feedback instead of numerical scores
- Direct integration of explanations into prompt optimization
- Solves problems unsolvable by pure prompt optimization
- Continuous improvement system
- Human control of prompt change acceptance
- Built-in instruction lifecycle management
- Powerful changes using individual examples
- Information-rich feedback from explanations
- Leverages existing eval traces or human annotations
- Handles competing instructions
- Supports instruction expiration
- Enables human review gates
- Deduplication and versioning
| Dimension | PromptAgent | Prompt Learning (PL) |
|---|---|---|
| Objective | Find single "expert-level" prompt maximizing numeric task score | Continuously maintain production prompt for self-healing |
| Optimizer | MCTS over prompt edits; each node = prompt, each edge = edit | Meta-prompt controller reading English critique |
| Update Granularity | Edits entire task prompt; final prompt frozen | Edits only Instruction section in fenced region |
| Use of Critiques | Generates constructive error feedback but text not kept in final prompt | Primary signal; English critique feeds meta-prompt |
| Conflict Management | None once search ends; manual pruning required | Built-in: deduplication, versioning, expiration |
| Online vs. Offline | Offline: heavy search then deployment | Online: one extra LLM call per failure |
| Data Requirement | Moderate-sized scored dev set | Works with single examples |
| Compute Cost | Front-loaded (search); negligible at inference | Minimal upfront, <1 extra call per optimization |
| Interpretability | Final prompt readable, reasoning hidden in search logs | Full audit trail: every instruction edit in plain English |
This is a research repository. For contributions:
- Create a new branch for your experiment
- Update this README with findings
- Submit a pull request
For questions about the research, contact: pjindal@arize.com