Skip to content

Commit e330d96

Browse files
feat: enable / disable chunked prefill for mockers (#2015)
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
1 parent 353146e commit e330d96

File tree

6 files changed

+167
-72
lines changed

6 files changed

+167
-72
lines changed

components/backends/mocker/README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,13 @@ The mocker engine is a mock vLLM implementation designed for testing and develop
99

1010
**Basic usage:**
1111

12-
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block-size`, `num-gpu-blocks`, `max-num-seqs`, `max-num-batched-tokens`, and `enable-prefix-caching` are common arguments shared with the real VLLM engine.
12+
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
1313

1414
And below are arguments that are mocker-specific:
1515
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
1616
- `dp_size`: Number of data parallel workers to simulate (default: 1)
1717
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
1818

19-
>[!NOTE]
20-
>Currently, `enable_chunked_prefill` is always assumed to be false, which mirrors the vllm v0 behavior. This is also the current behavior in `examples/llm`. This will be updated in the near future as we move to support vllm v1 (and deprecate support for vllm v0).
2119
```bash
2220
echo '{"speedup_ratio": 10.0}' > mocker_args.json
2321
python -m dynamo.mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json

docs/guides/dynamo_run.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -549,15 +549,13 @@ The mocker engine is a mock vLLM implementation designed for testing and develop
549549

550550
**Basic usage:**
551551

552-
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights. The arguments `block-size`, `num-gpu-blocks`, `max-num-seqs`, `max-num-batched-tokens`, and `enable-prefix-caching` are common arguments shared with the real VLLM engine.
552+
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
553553

554554
And below are arguments that are mocker-specific:
555555
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
556556
- `dp_size`: Number of data parallel workers to simulate (default: 1)
557557
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
558558

559-
>[!NOTE]
560-
>Currently, `enable_chunked_prefill` is always assumed to be false, which mirrors the vllm v0 behavior. This is also the current behavior in `examples/llm`. This will be updated in the near future as we move to support vllm v1 (and deprecate support for vllm v0).
561559
```bash
562560
echo '{"speedup_ratio": 10.0}' > mocker_args.json
563561
dynamo-run in=dyn://dynamo.mocker.generate out=mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json

lib/llm/src/mocker/kv_manager.rs

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -293,14 +293,9 @@ impl KvManager {
293293
let overlap_blocks = seq_blocks.len() - new_blocks;
294294
let new_tokens = sequence.num_input_tokens() - overlap_blocks * self.block_size;
295295

296-
// Calculate prefill compute
297-
let prefill_compute =
298-
1.25e-6 * (new_tokens as f64).powi(2) + 7.41e-2 * (new_tokens as f64) + 2.62e1;
299-
300296
PrefillCost {
301297
new_blocks,
302298
new_tokens,
303-
prefill_compute,
304299
}
305300
}
306301
}

lib/llm/src/mocker/protocols.rs

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,13 @@ pub struct DirectRequest {
5858
pub struct PrefillCost {
5959
pub new_blocks: usize,
6060
pub new_tokens: usize,
61-
pub prefill_compute: f64,
61+
}
62+
63+
impl PrefillCost {
64+
pub fn predict_prefill_compute(&self, new_tokens: Option<usize>) -> f64 {
65+
let tokens = new_tokens.unwrap_or(self.new_tokens);
66+
1.25e-6 * (tokens as f64).powi(2) + 7.41e-2 * (tokens as f64) + 2.62e1
67+
}
6268
}
6369

6470
/// Signal for output token generation with completion status
@@ -89,6 +95,9 @@ pub struct MockEngineArgs {
8995
#[builder(default = true)]
9096
pub enable_prefix_caching: bool,
9197

98+
#[builder(default = true)]
99+
pub enable_chunked_prefill: bool,
100+
92101
#[builder(default = "0.01")]
93102
pub watermark: f64,
94103

@@ -127,6 +136,7 @@ impl MockEngineArgs {
127136
"max_num_seqs",
128137
"max_num_batched_tokens",
129138
"enable_prefix_caching",
139+
"enable_chunked_prefill",
130140
"watermark",
131141
"speedup_ratio",
132142
"dp_size",
@@ -181,6 +191,12 @@ impl MockEngineArgs {
181191
}
182192
}
183193

194+
if let Some(value) = extra_args.get("enable_chunked_prefill") {
195+
if let Some(enabled) = value.as_bool() {
196+
builder = builder.enable_chunked_prefill(enabled);
197+
}
198+
}
199+
184200
if let Some(value) = extra_args.get("watermark") {
185201
if let Some(num) = value.as_f64() {
186202
builder = builder.watermark(num);

0 commit comments

Comments
 (0)