Skip to content

PriyanJindal/Prompt-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prompt Learning: Using English Feedback to Optimize LLM Systems

This repository contains research on Prompt Learning (PL), a novel approach to optimizing LLM prompts using natural language feedback instead of numerical scores.

Research Overview

Problem Statement

As more AI agents get deployed, a critical question emerges: can reinforcement learning control systems be built only in prompts? Traditional reinforcement learning relies on numerical scores and gradient updates, but prompts are the primary interface for guiding large language models. Can we apply RL-style methods to prompt refinement using natural language feedback?

Key Contributions

  • English Error Terms: Natural language feedback instead of numerical scores
  • Online Prompt Management: Continuous improvement system designed for production
  • Instruction Management: Built-in handling of competing, expiring, and human-reviewed instructions
  • Single-Example Learning: Powerful changes using individual examples instead of thousands
  • Cost Efficiency: One-tenth to one-hundredth the labeled examples of traditional RL

What Is Prompt Learning?

Prompt learning differs from traditional prompt optimization in two major ways:

1. English Error Terms

Instead of numerical scores, prompt learning uses English error terms - feedback provided in plain English text. For example:

"The generated JSON breaks several rules: the top-level key should be 'page', missing required 'updatedAt' field, and section types must use allowed vocabulary."

The English error term allows for direct feedback that can be used to tune instructions, solving problems that are unsolvable by current pure prompt optimization techniques.

2. Online Approach

Prompt learning is designed to run continually against your prompt, tuning instructions back into the context. LLM-based systems can assist with context engineering based on English error terms from evaluation explanations or human annotations.

Key Differences

Aspect Prompt Learning Reinforcement Learning Prompt Optimization
Feedback Mechanism Evaluation-based English explanations and human annotations Numeric rewards Numeric scores
Optimization Metaprompt defines optimization approach Updating model based on gradients Varied but some support metaprompts
Prompt Control Can optimize only specific section of prompt (instruction section) N/A Typically optimizes whole prompt
Online Setup Designed to be used always on, with human control of "prompt change" acceptance Designed to be used online Normally one-off

How Does the Optimization Loop Work?

In many real-world use cases, a single optimization run with a single shot output works great. For cases requiring multiple loops:

  1. English Explanation (Critique): Generated by evaluator to explain why an example failed
  2. Feedback Loop: Results used to improve the prompt
  3. Iterative Improvement: As more instructions are added to fix the prompt, the iterative loop becomes more important

Key Insight: In cases where only 1-10 instructions needed to be added, a single meta-prompt improvement loop was sufficient.

Experimental Design

Test Application: JSON Generation

We chose a JSON generation problem where models had to generate JSON for a webpage based on natural language prompts. We generated latent rules that responses needed to follow:

  • Every section needs a type value from a predefined list
  • All images must include alt text
  • All external asset links must use https

Evaluation Setup

  • Mimics Real-World: Designed to mimic typical evaluation cycle of an agent
  • Mixed Evaluation: LLM-as-a-judge techniques with human review
  • Latent Rules: Rules were implicitly represented in feedback and explanations
  • Optimization: Used modified version of meta-prompting dubbed "prompt learning"

Performance Results

Table 1: Prompt Learning Performance

Ruleset Size Initial Accuracy Baseline Accuracy w/ Ruleset Test Accuracy - 1 loop Test Accuracy - 5 loops
10 0% 41% 84% 100%
50 0% 14% 66% 82%
100 0% 6% 0% 67%

Key Findings

  1. Uncovers Latent Rules: Prompt learning can uncover and address the majority of latent rules within the 5-25 ruleset range
  2. Performance Stability: As more rules are introduced, performance does not drop significantly
  3. Dramatic Improvement: Prompt learning drastically outperforms un-optimized cases (near-zero baseline accuracy)
  4. Cost Efficiency: Achieves same level of improvements as reinforcement learning with one-tenth to one-hundredth the number of labeled examples

Repository Structure

Prompt-Learning/
├── data/                   # Datasets and evaluation data
│   ├── queries.csv        # Main dataset
│   └── README.md          # Data documentation
├── prompts/               # Prompt templates for different rule counts
│   ├── evaluator-prompt-10.txt
│   ├── evaluator-prompt-50.txt
│   ├── evaluator-prompt-100.txt
│   ├── rule-checker-prompt-10.txt
│   ├── rule-checker-prompt-50.txt
│   ├── rule-checker-prompt-100.txt
│   ├── metrics-prompt-10.txt
│   ├── metrics-prompt-50.txt
│   └── metrics-prompt-100.txt
├── notebooks/             # Jupyter notebooks for experiments      
│   └── prompt_learning_cookbook_AX.ipynb
├── meta_prompt.py         # Core meta-prompt implementation
├── meta_prompt_optimizer.py # Meta-prompt optimizer
├── prompt_learning_run.py # Main experiment runner
├── tiktoken_splitter.py   # Token counting utilities
├── train.csv              # Training dataset
├── test.csv               # Test dataset
├── requirements.txt       # Python dependencies
└── README.md             # This file

Quick Start

Installation

pip install -r requirements.txt

Environment Setup

export OPENAI_API_KEY="your-api-key-here"

Basic Usage

import pandas as pd
from meta_prompt_optimizer import MetaPromptOptimizer

# Create dataset with English feedback
dataset = pd.DataFrame({
    'input': ["Generate a tech company's career page"],
    'output': ["{incorrect JSON output}"],
    'feedback': ["The generated JSON breaks several rules: missing 'updatedAt' field, top-level key should be 'page'"]
})

# Initialize optimizer
optimizer = MetaPromptOptimizer(
    prompt="You are an expert in JSON webpage creation. Generate: {input}",
    model_choice="gpt-4"
)

# Optimize the prompt using English feedback
optimized_prompt = optimizer.optimize(
    dataset=dataset,
    output_column='output',
    feedback_columns=['feedback']
)

Running Experiments

# Single experiment
python prompt_learning_run.py

# Multi-rule experiments (10, 50, 100 rules)
# Edit RUN_MULTI_RULE_EXPERIMENTS = True in prompt_learning_run.py

Key Innovations

1. English Error Terms

  • Natural language feedback instead of numerical scores
  • Direct integration of explanations into prompt optimization
  • Solves problems unsolvable by pure prompt optimization

2. Online Prompt Management

  • Continuous improvement system
  • Human control of prompt change acceptance
  • Built-in instruction lifecycle management

3. Single-Example Learning

  • Powerful changes using individual examples
  • Information-rich feedback from explanations
  • Leverages existing eval traces or human annotations

4. Instruction Management

  • Handles competing instructions
  • Supports instruction expiration
  • Enables human review gates
  • Deduplication and versioning

Comparison with Other Approaches

vs. PromptAgent (ICLR '24)

Dimension PromptAgent Prompt Learning (PL)
Objective Find single "expert-level" prompt maximizing numeric task score Continuously maintain production prompt for self-healing
Optimizer MCTS over prompt edits; each node = prompt, each edge = edit Meta-prompt controller reading English critique
Update Granularity Edits entire task prompt; final prompt frozen Edits only Instruction section in fenced region
Use of Critiques Generates constructive error feedback but text not kept in final prompt Primary signal; English critique feeds meta-prompt
Conflict Management None once search ends; manual pruning required Built-in: deduplication, versioning, expiration
Online vs. Offline Offline: heavy search then deployment Online: one extra LLM call per failure
Data Requirement Moderate-sized scored dev set Works with single examples
Compute Cost Front-loaded (search); negligible at inference Minimal upfront, <1 extra call per optimization
Interpretability Final prompt readable, reasoning hidden in search logs Full audit trail: every instruction edit in plain English

Contributing

This is a research repository. For contributions:

  1. Create a new branch for your experiment
  2. Update this README with findings
  3. Submit a pull request

Contact

For questions about the research, contact: pjindal@arize.com

About

Research repository for meta-prompt optimization techniques, developed at Arize AI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •