Skip to content

Commit e728ab8

Browse files
PeaBranedillon-cullinan
authored andcommitted
chore: pass in mocker engine args directly in python cli + default frontend port to 8000 (#2853)
Signed-off-by: PeaBrane <yanrpei@gmail.com>
1 parent f4d49b0 commit e728ab8

21 files changed

+309
-166
lines changed

components/backends/mocker/README.md

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,47 @@ The mocker engine is a mock vLLM implementation designed for testing and develop
77
- Developing and debugging Dynamo components
88
- Load testing and performance analysis
99

10-
**Basic usage:**
10+
## Basic usage
1111

12-
The `--model-path` is required but can point to any valid model path - the mocker doesn't actually load the model weights (but the pre-processor needs the tokenizer). The arguments `block_size`, `num_gpu_blocks`, `max_num_seqs`, `max_num_batched_tokens`, `enable_prefix_caching`, and `enable_chunked_prefill` are common arguments shared with the real VLLM engine.
12+
The mocker engine now supports a vLLM-style CLI interface with individual arguments for all configuration options.
1313

14-
And below are arguments that are mocker-specific:
15-
- `speedup_ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster.
16-
- `dp_size`: Number of data parallel workers to simulate (default: 1)
17-
- `watermark`: KV cache watermark threshold as a fraction (default: 0.01). This argument also exists for the real VLLM engine but cannot be passed as an engine arg.
14+
### Required arguments:
15+
- `--model-path`: Path to model directory or HuggingFace model ID (required for tokenizer)
1816

17+
### MockEngineArgs parameters (vLLM-style):
18+
- `--num-gpu-blocks-override`: Number of GPU blocks for KV cache (default: 16384)
19+
- `--block-size`: Token block size for KV cache blocks (default: 64)
20+
- `--max-num-seqs`: Maximum number of sequences per iteration (default: 256)
21+
- `--max-num-batched-tokens`: Maximum number of batched tokens per iteration (default: 8192)
22+
- `--enable-prefix-caching` / `--no-enable-prefix-caching`: Enable/disable automatic prefix caching (default: True)
23+
- `--enable-chunked-prefill` / `--no-enable-chunked-prefill`: Enable/disable chunked prefill (default: True)
24+
- `--watermark`: KV cache watermark threshold as a fraction (default: 0.01)
25+
- `--speedup-ratio`: Speed multiplier for token generation (default: 1.0). Higher values make the simulation engines run faster
26+
- `--data-parallel-size`: Number of data parallel workers to simulate (default: 1)
27+
28+
### Example with individual arguments (vLLM-style):
1929
```bash
20-
echo '{"speedup_ratio": 10.0}' > mocker_args.json
21-
python -m dynamo.mocker --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 --extra-engine-args mocker_args.json
30+
# Start mocker with custom configuration
31+
python -m dynamo.mocker \
32+
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
33+
--num-gpu-blocks-override 8192 \
34+
--block-size 16 \
35+
--speedup-ratio 10.0 \
36+
--max-num-seqs 512 \
37+
--enable-prefix-caching
38+
39+
# Start frontend server
2240
python -m dynamo.frontend --http-port 8080
23-
```
41+
```
42+
43+
### Legacy JSON file support:
44+
For backward compatibility, you can still provide configuration via a JSON file:
45+
46+
```bash
47+
echo '{"speedup_ratio": 10.0, "num_gpu_blocks": 8192}' > mocker_args.json
48+
python -m dynamo.mocker \
49+
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
50+
--extra-engine-args mocker_args.json
51+
```
52+
53+
Note: If `--extra-engine-args` is provided, it overrides all individual CLI arguments.

components/backends/mocker/src/dynamo/mocker/main.py

Lines changed: 163 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33

4-
# Usage: `python -m dynamo.mocker --model-path /data/models/Qwen3-0.6B-Q8_0.gguf --extra-engine-args args.json`
4+
# Usage: `python -m dynamo.mocker --model-path /data/models/Qwen3-0.6B-Q8_0.gguf`
5+
# Now supports vLLM-style individual arguments for MockEngineArgs
56

67
import argparse
8+
import json
9+
import logging
710
import os
11+
import tempfile
812
from pathlib import Path
913

1014
import uvloop
@@ -19,35 +23,94 @@
1923
DEFAULT_ENDPOINT = f"dyn://{DYN_NAMESPACE}.backend.generate"
2024

2125
configure_dynamo_logging()
26+
logger = logging.getLogger(__name__)
27+
28+
29+
def create_temp_engine_args_file(args) -> Path:
30+
"""
31+
Create a temporary JSON file with MockEngineArgs from CLI arguments.
32+
Returns the path to the temporary file.
33+
"""
34+
engine_args = {}
35+
36+
# Only include non-None values that differ from defaults
37+
# Note: argparse converts hyphens to underscores in attribute names
38+
# Extract all potential engine arguments, using None as default for missing attributes
39+
engine_args = {
40+
"num_gpu_blocks": getattr(args, "num_gpu_blocks", None),
41+
"block_size": getattr(args, "block_size", None),
42+
"max_num_seqs": getattr(args, "max_num_seqs", None),
43+
"max_num_batched_tokens": getattr(args, "max_num_batched_tokens", None),
44+
"enable_prefix_caching": getattr(args, "enable_prefix_caching", None),
45+
"enable_chunked_prefill": getattr(args, "enable_chunked_prefill", None),
46+
"watermark": getattr(args, "watermark", None),
47+
"speedup_ratio": getattr(args, "speedup_ratio", None),
48+
"dp_size": getattr(args, "dp_size", None),
49+
"startup_time": getattr(args, "startup_time", None),
50+
}
51+
52+
# Remove None values to only include explicitly set arguments
53+
engine_args = {k: v for k, v in engine_args.items() if v is not None}
54+
55+
# Create temporary file
56+
with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
57+
json.dump(engine_args, f, indent=2)
58+
temp_path = Path(f.name)
59+
60+
logger.debug(f"Created temporary MockEngineArgs file at {temp_path}")
61+
logger.debug(f"MockEngineArgs: {engine_args}")
62+
63+
return temp_path
2264

2365

2466
@dynamo_worker(static=False)
2567
async def worker(runtime: DistributedRuntime):
2668
args = cmd_line_args()
2769

28-
# Create engine configuration
29-
entrypoint_args = EntrypointArgs(
30-
engine_type=EngineType.Mocker,
31-
model_path=args.model_path,
32-
model_name=args.model_name,
33-
endpoint_id=args.endpoint,
34-
extra_engine_args=args.extra_engine_args,
35-
)
36-
37-
# Create and run the engine
38-
# NOTE: only supports dyn endpoint for now
39-
engine_config = await make_engine(runtime, entrypoint_args)
40-
await run_input(runtime, args.endpoint, engine_config)
70+
# Handle extra_engine_args: either use provided file or create from CLI args
71+
if args.extra_engine_args:
72+
# User provided explicit JSON file
73+
extra_engine_args_path = args.extra_engine_args
74+
logger.info(f"Using provided MockEngineArgs from {extra_engine_args_path}")
75+
else:
76+
# Create temporary JSON file from CLI arguments
77+
extra_engine_args_path = create_temp_engine_args_file(args)
78+
logger.info("Created MockEngineArgs from CLI arguments")
79+
80+
try:
81+
# Create engine configuration
82+
entrypoint_args = EntrypointArgs(
83+
engine_type=EngineType.Mocker,
84+
model_path=args.model_path,
85+
model_name=args.model_name,
86+
endpoint_id=args.endpoint,
87+
extra_engine_args=extra_engine_args_path,
88+
)
89+
90+
# Create and run the engine
91+
# NOTE: only supports dyn endpoint for now
92+
engine_config = await make_engine(runtime, entrypoint_args)
93+
await run_input(runtime, args.endpoint, engine_config)
94+
finally:
95+
# Clean up temporary file if we created one
96+
if not args.extra_engine_args and extra_engine_args_path.exists():
97+
try:
98+
extra_engine_args_path.unlink()
99+
logger.debug(f"Cleaned up temporary file {extra_engine_args_path}")
100+
except Exception as e:
101+
logger.warning(f"Failed to clean up temporary file: {e}")
41102

42103

43104
def cmd_line_args():
44105
parser = argparse.ArgumentParser(
45-
description="Mocker engine for testing Dynamo LLM infrastructure.",
106+
description="Mocker engine for testing Dynamo LLM infrastructure with vLLM-style CLI.",
46107
formatter_class=argparse.RawDescriptionHelpFormatter,
47108
)
48109
parser.add_argument(
49110
"--version", action="version", version=f"Dynamo Mocker {__version__}"
50111
)
112+
113+
# Basic configuration
51114
parser.add_argument(
52115
"--model-path",
53116
type=str,
@@ -63,13 +126,95 @@ def cmd_line_args():
63126
"--model-name",
64127
type=str,
65128
default=None,
66-
help="Model name for API responses (default: mocker-engine)",
129+
help="Model name for API responses (default: derived from model-path)",
67130
)
131+
132+
# MockEngineArgs parameters (similar to vLLM style)
133+
parser.add_argument(
134+
"--num-gpu-blocks-override",
135+
type=int,
136+
dest="num_gpu_blocks", # Maps to num_gpu_blocks in MockEngineArgs
137+
default=None,
138+
help="Number of GPU blocks for KV cache (default: 16384)",
139+
)
140+
parser.add_argument(
141+
"--block-size",
142+
type=int,
143+
default=None,
144+
help="Token block size for KV cache blocks (default: 64)",
145+
)
146+
parser.add_argument(
147+
"--max-num-seqs",
148+
type=int,
149+
default=None,
150+
help="Maximum number of sequences per iteration (default: 256)",
151+
)
152+
parser.add_argument(
153+
"--max-num-batched-tokens",
154+
type=int,
155+
default=None,
156+
help="Maximum number of batched tokens per iteration (default: 8192)",
157+
)
158+
parser.add_argument(
159+
"--enable-prefix-caching",
160+
action="store_true",
161+
dest="enable_prefix_caching",
162+
default=None,
163+
help="Enable automatic prefix caching (default: True)",
164+
)
165+
parser.add_argument(
166+
"--no-enable-prefix-caching",
167+
action="store_false",
168+
dest="enable_prefix_caching",
169+
default=None,
170+
help="Disable automatic prefix caching",
171+
)
172+
parser.add_argument(
173+
"--enable-chunked-prefill",
174+
action="store_true",
175+
dest="enable_chunked_prefill",
176+
default=None,
177+
help="Enable chunked prefill (default: True)",
178+
)
179+
parser.add_argument(
180+
"--no-enable-chunked-prefill",
181+
action="store_false",
182+
dest="enable_chunked_prefill",
183+
default=None,
184+
help="Disable chunked prefill",
185+
)
186+
parser.add_argument(
187+
"--watermark",
188+
type=float,
189+
default=None,
190+
help="Watermark value for the mocker engine (default: 0.01)",
191+
)
192+
parser.add_argument(
193+
"--speedup-ratio",
194+
type=float,
195+
default=None,
196+
help="Speedup ratio for mock execution (default: 1.0)",
197+
)
198+
parser.add_argument(
199+
"--data-parallel-size",
200+
type=int,
201+
dest="dp_size",
202+
default=None,
203+
help="Number of data parallel replicas (default: 1)",
204+
)
205+
parser.add_argument(
206+
"--startup-time",
207+
type=float,
208+
default=None,
209+
help="Simulated engine startup time in seconds (default: None)",
210+
)
211+
212+
# Legacy support - allow direct JSON file specification
68213
parser.add_argument(
69214
"--extra-engine-args",
70215
type=Path,
71-
help="Path to JSON file with mocker configuration "
72-
"(num_gpu_blocks, speedup_ratio, etc.)",
216+
help="Path to JSON file with mocker configuration. "
217+
"If provided, overrides individual CLI arguments.",
73218
)
74219

75220
return parser.parse_args()

components/frontend/src/dynamo/frontend/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def parse_args():
101101
parser.add_argument(
102102
"--http-port",
103103
type=int,
104-
default=int(os.environ.get("DYN_HTTP_PORT", "8080")),
104+
default=int(os.environ.get("DYN_HTTP_PORT", "8000")),
105105
help="HTTP port for the engine (u16). Can be set via DYN_HTTP_PORT env var.",
106106
)
107107
parser.add_argument(

lib/llm/src/mocker/engine.rs

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,5 @@
11
// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
// SPDX-License-Identifier: Apache-2.0
3-
//
4-
// Licensed under the Apache License, Version 2.0 (the "License");
5-
// you may not use this file except in compliance with the License.
6-
// You may obtain a copy of the License at
7-
//
8-
// http://www.apache.org/licenses/LICENSE-2.0
9-
//
10-
// Unless required by applicable law or agreed to in writing, software
11-
// distributed under the License is distributed on an "AS IS" BASIS,
12-
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
// See the License for the specific language governing permissions and
14-
// limitations under the License.
153

164
//! MockSchedulerEngine - AsyncEngine wrapper around the Scheduler
175
//!
@@ -76,6 +64,13 @@ impl MockVllmEngine {
7664
pub async fn start(&self, component: Component) -> Result<()> {
7765
let cancel_token = component.drt().runtime().child_token();
7866

67+
// Simulate engine startup time if configured
68+
if let Some(startup_time_secs) = self.engine_args.startup_time {
69+
tracing::info!("Simulating engine startup time: {:.2}s", startup_time_secs);
70+
tokio::time::sleep(Duration::from_secs_f64(startup_time_secs)).await;
71+
tracing::info!("Engine startup simulation completed");
72+
}
73+
7974
let (schedulers, kv_event_receiver) = self.start_schedulers(
8075
self.engine_args.clone(),
8176
self.active_requests.clone(),

lib/llm/src/mocker/evictor.rs

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,5 @@
11
// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
// SPDX-License-Identifier: Apache-2.0
3-
//
4-
// Licensed under the Apache License, Version 2.0 (the "License");
5-
// you may not use this file except in compliance with the License.
6-
// You may obtain a copy of the License at
7-
//
8-
// http://www.apache.org/licenses/LICENSE-2.0
9-
//
10-
// Unless required by applicable law or agreed to in writing, software
11-
// distributed under the License is distributed on an "AS IS" BASIS,
12-
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
// See the License for the specific language governing permissions and
14-
// limitations under the License.
153

164
use std::cmp::{Eq, Ordering};
175
use std::collections::{BTreeSet, HashMap};

lib/llm/src/mocker/kv_manager.rs

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,5 @@
11
// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
// SPDX-License-Identifier: Apache-2.0
3-
//
4-
// Licensed under the Apache License, Version 2.0 (the "License");
5-
// you may not use this file except in compliance with the License.
6-
// You may obtain a copy of the License at
7-
//
8-
// http://www.apache.org/licenses/LICENSE-2.0
9-
//
10-
// Unless required by applicable law or agreed to in writing, software
11-
// distributed under the License is distributed on an "AS IS" BASIS,
12-
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13-
// See the License for the specific language governing permissions and
14-
// limitations under the License.
153

164
//! # KV Manager
175
//! A synchronous implementation of a block manager that handles MoveBlock signals for caching KV blocks.

0 commit comments

Comments
 (0)