Character Prefix Conditioning

A minimal, efficient implementation of character prefix conditioning (CPC) for code completion, inspired by the Cursor blog.

Overview

When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.

CPC is an algorithm for sampling a sequence of tokens conditioned on a character prefix, ensuring completions always start with the user's typed prefix—even if it doesn't align with token boundaries.

Mathematical Foundation

Problem Statement

We want to sample a sequence of tokens $s = t_1, t_2, \ldots, t_n$ from a distribution specified by an autoregressive model $p(s)$ given by:

$$p(s) = p(t_1, t_2, \ldots, t_n) = \prod_{k=1}^{n} p(t_k \mid t_1, \ldots, t_{k-1})$$

subject to the constraint that $s$ starts with a character prefix $P$, i.e., $P$ is a prefix of $\text{repr}(t_1) + \text{repr}(t_2) + \cdots + \text{repr}(t_n)$, where $+$ means string concatenation and $\text{repr}$ maps a token to the characters it represents.

We define $q(s) = p(s \mid s \text{ starts with } P)$. It's sufficient to find a way to sample autoregressively from $q(s)$, that is, to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$ for each $k$.

Algorithm

For each step $k$, we need to sample from $q(t_k \mid t_1, \ldots, t_{k-1})$. Here's the efficient algorithm:

Get model predictions: Compute $p(t_k \mid t_1, \ldots, t_{k-1})$ from the language model for all possible tokens $t_k$
Apply constraint mask: For each token $t_k$, check if appending it to the current sequence would satisfy the character prefix constraint $P$. Create a binary mask $M(t_k)$ where:
- $M(t_k) = 1$ if $\text{repr}(t_1) + \cdots + \text{repr}(t_{k-1}) + \text{repr}(t_k)$ starts with $P$
- $M(t_k) = 0$ otherwise
Renormalize probabilities: Compute the constrained distribution: $$q(t_k) = \frac{p(t_k) \cdot M(t_k)}{\sum_{t'} p(t') \cdot M(t')}$$
Sample from constrained distribution: Sample $t_k \sim q(t_k)$
Terminate when constraint is satisfied: Stop when the generated sequence starts with the prefix $P$

Key Insights

Efficiency: The algorithm requires only one forward pass through the language model per generated token, minimizing model calls.
Vectorization: The constraint checking (for all possible next tokens) is vectorized across the vocabulary, making it efficient despite being O(|V|) per step.
Early termination: Generation can stop once the constraint is satisfied, then continue normally.
Fallback strategies: For edge cases where no valid tokens have sufficient probability, the algorithm can fall back to the most probable valid token or even violate the constraint with a retry mechanism.

Complexity Analysis

Per step: O(|V|) constraint checking (vectorized), 1 model call
Total: O(n · |V|) constraint checks and O(n) model calls for generating n tokens
Memory: O(|V|) for storing token representations and masks
Optimizations: KV caching reduces repeated computations, early termination reduces total steps

Setup

Install dependencies:

uv sync

Run the main script:

uv run main.py

Usage

from main import ModelManager, character_prefix_sample

# Initialize and load model
model_manager = ModelManager("gpt2")
model_manager.load_model()

# Generate with character prefix constraint
result = character_prefix_sample(
    model_manager=model_manager,
    prompt_text="import",
    character_prefix="import num",
    max_new_tokens=15
)
print(result)  # Output: "import numpy as np"

Examples

The implementation includes comprehensive test cases demonstrating various scenarios:

Simple prefix matching: "import" → "import num" → "import numpy as np"
Mid-token completion: "The model's behav" → "The model's behavi" → "The model's behavior"
F-string completion: 'print(f"The result is {re' → 'print(f"The result is {res' → 'print(f"The result is {result}"'
Empty prompt generation: "" → "Once upon a ti" → "Once upon a time"
JSON completion: '{"data": {"user' → '{"data": {"username": "test' → '{"data": {"username": "test"}}'

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
.python-version		.python-version
README.MD		README.MD
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Character Prefix Conditioning

Overview

Mathematical Foundation

Problem Statement

Algorithm

Key Insights

Complexity Analysis

Setup

Usage

Examples

About

Uh oh!

Releases

Packages

Languages

Muhtasham/char-prefix-conditioning

Folders and files

Latest commit

History

Repository files navigation

Character Prefix Conditioning

Overview

Mathematical Foundation

Problem Statement

Algorithm

Key Insights

Complexity Analysis

Setup

Usage

Examples

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages