Skip to content

Muhtasham/char-prefix-conditioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Character Prefix Conditioning

A minimal, efficient implementation of character prefix conditioning (CPC) for code completion, inspired by the Cursor blog.

Overview

When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.

CPC is an algorithm for sampling a sequence of tokens conditioned on a character prefix, ensuring completions always start with the user's typed prefix—even if it doesn't align with token boundaries.

Mathematical Foundation

Problem Statement

We want to sample a sequence of tokens $s = t_1, t_2, \ldots, t_n$ from a distribution specified by an autoregressive model $p(s)$ given by:

$$p(s) = p(t_1, t_2, \ldots, t_n) = \prod_{k=1}^{n} p(t_k \mid t_1, \ldots, t_{k-1})$$

subject to the constraint that $s$ starts with a character prefix $P$, i.e., $P$ is a prefix of $\text{repr}(t_1) + \text{repr}(t_2) + \cdots + \text{repr}(t_n)$, where $+$ means string concatenation and $\text{repr}$ maps a token to the characters it represents.

We define $q(s) = p(s \mid s \text{ starts with } P)$. It's sufficient to find a way to sample autoregressively from $q(s)$, that is, to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$ for each $k$.

Algorithm

For each step $k$, we need to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$. Here's the efficient algorithm:

  1. Get model predictions: Compute $p(t_k \mid t_1, \ldots, t_{k-1})$ from the language model for all possible tokens $t_k$

  2. Apply constraint mask: For each token $t_k$, check if appending it to the current sequence would satisfy the character prefix constraint $P$. Create a binary mask $M(t_k)$ where:

    • $M(t_k) = 1$ if $\text{repr}(t_1) + \cdots + \text{repr}(t_{k-1}) + \text{repr}(t_k)$ starts with $P$
    • $M(t_k) = 0$ otherwise
  3. Renormalize probabilities: Compute the constrained distribution: $$q(t_k) = \frac{p(t_k) \cdot M(t_k)}{\sum_{t'} p(t') \cdot M(t')}$$

  4. Sample from constrained distribution: Sample $t_k \sim q(t_k)$

  5. Terminate when constraint is satisfied: Stop when the generated sequence starts with the prefix $P$

Key Insights

  • Efficiency: The algorithm requires only one forward pass through the language model per generated token, minimizing model calls.
  • Vectorization: The constraint checking (for all possible next tokens) is vectorized across the vocabulary, making it efficient despite being O(|V|) per step.
  • Early termination: Generation can stop once the constraint is satisfied, then continue normally.
  • Fallback strategies: For edge cases where no valid tokens have sufficient probability, the algorithm can fall back to the most probable valid token or even violate the constraint with a retry mechanism.

Complexity Analysis

  • Per step: O(|V|) constraint checking (vectorized), 1 model call
  • Total: O(n · |V|) constraint checks and O(n) model calls for generating n tokens
  • Memory: O(|V|) for storing token representations and masks
  • Optimizations: KV caching reduces repeated computations, early termination reduces total steps

Setup

Install dependencies:

uv sync

Run the main script:

uv run main.py

Usage

from main import ModelManager, character_prefix_sample

# Initialize and load model
model_manager = ModelManager("gpt2")
model_manager.load_model()

# Generate with character prefix constraint
result = character_prefix_sample(
    model_manager=model_manager,
    prompt_text="import",
    character_prefix="import num",
    max_new_tokens=15
)
print(result)  # Output: "import numpy as np"

Examples

The implementation includes comprehensive test cases demonstrating various scenarios:

  • Simple prefix matching: "import""import num""import numpy as np"
  • Mid-token completion: "The model's behav""The model's behavi""The model's behavior"
  • F-string completion: 'print(f"The result is {re''print(f"The result is {res''print(f"The result is {result}"'
  • Empty prompt generation: """Once upon a ti""Once upon a time"
  • JSON completion: '{"data": {"user''{"data": {"username": "test''{"data": {"username": "test"}}'

About

A minimal, efficient implementation of character prefix conditioning for code completion.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages