A minimal, efficient implementation of character prefix conditioning (CPC) for code completion, inspired by the Cursor blog.
When using a language model for code completion, we typically want the model to produce a completion that begins with what the user has typed. However, modern language models operate on sequences of tokens, not characters, so naively tokenizing the user's input and sending it to the model produces wrong results if the user's cursor doesn't happen to lie on a token boundary.
CPC is an algorithm for sampling a sequence of tokens conditioned on a character prefix, ensuring completions always start with the user's typed prefix—even if it doesn't align with token boundaries.
We want to sample a sequence of tokens 
subject to the constraint that 
We define 
For each step 
- 
Get model predictions: Compute
$p(t_k \mid t_1, \ldots, t_{k-1})$ from the language model for all possible tokens$t_k$  - 
Apply constraint mask: For each token
$t_k$ , check if appending it to the current sequence would satisfy the character prefix constraint$P$ . Create a binary mask$M(t_k)$ where:- 
$M(t_k) = 1$ if$\text{repr}(t_1) + \cdots + \text{repr}(t_{k-1}) + \text{repr}(t_k)$ starts with$P$  - 
$M(t_k) = 0$ otherwise 
 - 
 - 
Renormalize probabilities: Compute the constrained distribution:
$$q(t_k) = \frac{p(t_k) \cdot M(t_k)}{\sum_{t'} p(t') \cdot M(t')}$$  - 
Sample from constrained distribution: Sample
$t_k \sim q(t_k)$  - 
Terminate when constraint is satisfied: Stop when the generated sequence starts with the prefix
$P$  
- Efficiency: The algorithm requires only one forward pass through the language model per generated token, minimizing model calls.
 - Vectorization: The constraint checking (for all possible next tokens) is vectorized across the vocabulary, making it efficient despite being O(|V|) per step.
 - Early termination: Generation can stop once the constraint is satisfied, then continue normally.
 - Fallback strategies: For edge cases where no valid tokens have sufficient probability, the algorithm can fall back to the most probable valid token or even violate the constraint with a retry mechanism.
 
- Per step: O(|V|) constraint checking (vectorized), 1 model call
 - Total: O(n · |V|) constraint checks and O(n) model calls for generating n tokens
 - Memory: O(|V|) for storing token representations and masks
 - Optimizations: KV caching reduces repeated computations, early termination reduces total steps
 
Install dependencies:
uv syncRun the main script:
uv run main.pyfrom main import ModelManager, character_prefix_sample
# Initialize and load model
model_manager = ModelManager("gpt2")
model_manager.load_model()
# Generate with character prefix constraint
result = character_prefix_sample(
    model_manager=model_manager,
    prompt_text="import",
    character_prefix="import num",
    max_new_tokens=15
)
print(result)  # Output: "import numpy as np"The implementation includes comprehensive test cases demonstrating various scenarios:
- Simple prefix matching: 
"import"→"import num"→"import numpy as np" - Mid-token completion: 
"The model's behav"→"The model's behavi"→"The model's behavior" - F-string completion: 
'print(f"The result is {re'→'print(f"The result is {res'→'print(f"The result is {result}"' - Empty prompt generation: 
""→"Once upon a ti"→"Once upon a time" - JSON completion: 
'{"data": {"user'→'{"data": {"username": "test'→'{"data": {"username": "test"}}'