-
Notifications
You must be signed in to change notification settings - Fork 676
feat: vllm prefill router #3155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
WalkthroughDefault benchmark ports switched to 8000. Router launch scripts parameterized and expanded for prefill/decode workers. Added a Prefill Router service and integrated router-aware prefill selection in decode path. Introduced KV event publisher setup in vLLM main. Updated Rust KV router/scheduler APIs and Python bindings, including best_worker_id signature changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant DecodeHandler
participant PrefillRouter
participant PrefillWorker
participant RR as RoundRobin
Note over DecodeHandler: New: optional router-aware prefill selection
Client->>DecodeHandler: generate(request with prompt)
alt Prefill router available and has instances
DecodeHandler->>PrefillRouter: best_worker_id(token_ids)
alt worker_id found
PrefillRouter-->>DecodeHandler: worker_id, overlap_blocks
DecodeHandler->>PrefillWorker: prefill.direct(request, worker_id)
PrefillWorker-->>DecodeHandler: prefill result
else no decision / error
PrefillRouter--x DecodeHandler: fallback
DecodeHandler->>RR: prefill via round_robin
RR-->>DecodeHandler: prefill result
end
else Router unavailable
DecodeHandler->>RR: prefill via round_robin
RR-->>DecodeHandler: prefill result
end
DecodeHandler-->>Client: streamed generation
sequenceDiagram
autonumber
participant Main as vLLM Main
participant Engine as vLLM Engine Setup
participant Handler as Worker Handler
participant KV as KV Event Publisher
Note over Main: New: setup_kv_event_publisher()
Main->>Engine: setup_vllm_engine()
Engine-->>Main: engine_client, vllm_config, default_sampling_params
Main->>Main: setup_kv_event_publisher(config, component, generate_endpoint, vllm_config)
alt KV enabled
Main-->>KV: create ZmqKvEventPublisher
Main->>Handler: attach kv_publisher
else KV disabled
Main-->>Handler: no publisher attached
end
Main-->>Handler: start serving
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/kv_router/scheduler.rs (1)
389-458: Fix RNG API usage and softmax normalization (compile/runtime bugs).rand = "0.9.0" found at lib/async-openai/Cargo.toml:41 — switch to thread_rng()/gen_range()/gen() and use proper min-max normalization (v - min) / (max - min) before negation.
File: lib/llm/src/kv_router/scheduler.rs Lines: 389-458
- let mut rng = rand::rng(); - let index = rng.random_range(0..min_keys.len()); + let mut rng = rand::thread_rng(); + let index = rng.gen_range(0..min_keys.len()); @@ - let mut rng = rand::rng(); - let sample: f64 = rng.random(); + let mut rng = rand::thread_rng(); + let sample: f64 = rng.gen(); @@ - let normalized: Vec<_> = values - .iter() - .map(|&v| { - // Lower is better, so negate - // Note we don't need to do actual min-max norm here, just off by an offset - let norm = v / (max_val - min_val); - -norm - }) - .collect(); + let normalized: Vec<_> = values + .iter() + .map(|&v| { + // Lower is better, so negate after proper min-max normalization + let norm = (v - min_val) / (max_val - min_val); + -norm + }) + .collect();
🧹 Nitpick comments (12)
components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py (1)
4-12: Consider more specific exception handling for version resolution.While the fallback version resolution is resilient, catching all exceptions can mask unexpected errors during development.
Apply this diff to use more specific exception types:
try: from ._version import __version__ -except Exception: +except ImportError: try: from importlib.metadata import version as _pkg_version __version__ = _pkg_version("ai-dynamo") - except Exception: + except (ImportError, ModuleNotFoundError, KeyError): __version__ = "0.0.0+unknown"components/backends/vllm/launch/agg_router.sh (1)
10-12: Consider parameterizing model and block size configuration.While centralizing these values improves maintainability, consider making them configurable via environment variables or command-line arguments for flexibility across different deployments.
Apply this diff to make the configuration more flexible:
# Common configuration -MODEL="Qwen/Qwen3-0.6B" -BLOCK_SIZE=64 +MODEL="${MODEL:-Qwen/Qwen3-0.6B}" +BLOCK_SIZE="${BLOCK_SIZE:-64}"components/backends/vllm/src/dynamo/vllm/handlers.py (2)
13-13: Fix import sorting issue flagged by pipeline.The pipeline indicates that imports need to be sorted according to isort rules.
Run isort to fix the import order:
isort components/backends/vllm/src/dynamo/vllm/handlers.py
108-123: Add error recovery mechanism for prefill availability check.The background task logs errors but doesn't implement any recovery strategy. Consider adding exponential backoff or circuit breaker pattern for resilience.
Apply this diff to add basic exponential backoff:
async def _prefill_check_loop(self): """Background task that checks prefill worker availability every 5 seconds.""" + backoff = 5 # Initial backoff in seconds + max_backoff = 60 # Maximum backoff in seconds while True: try: if self.prefill_worker_client is not None: self.can_prefill = len(self.prefill_worker_client.instance_ids()) logger.debug(f"Current Prefill Workers: {self.can_prefill}") else: self.can_prefill = 0 + backoff = 5 # Reset backoff on success except asyncio.CancelledError: logger.warning("Prefill check loop cancelled.") raise except Exception as e: logger.error(f"Error in prefill check loop: {e}") + backoff = min(backoff * 2, max_backoff) # Exponential backoff + logger.debug(f"Backing off for {backoff} seconds") - await asyncio.sleep(5) + await asyncio.sleep(backoff)components/backends/vllm/src/dynamo/vllm/main.py (1)
8-8: Fix import sorting issue flagged by pipeline.The pipeline indicates that imports need to be sorted according to isort rules.
Run isort to fix the import order:
isort components/backends/vllm/src/dynamo/vllm/main.pylib/llm/src/kv_router/scheduler.rs (2)
524-529: Reduce hot-path log level.Per-request formula logs at INFO will spam logs. Use DEBUG or TRACE.
- tracing::info!( + tracing::debug!( "Formula for {worker_id} with {overlap} cached blocks: {logit:.3} \ = {overlap_weight:.1} * prefill_blocks + decode_blocks \ = {overlap_weight:.1} * {potential_prefill_block:.3} + {decode_block:.3}" );
42-43: Typo in error string.- #[error("no endpoints aviailable to route work")] + #[error("no endpoints available to route work")]components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)
68-71: Log full traceback on init failure.Use
logger.exception(...)to capture stack trace.- except Exception as e: - logger.error(f"Failed to initialize KvPushRouter: {e}") + except Exception: + logger.exception("Failed to initialize KvPushRouter") raise
72-87: Unusedcontextarg; rename to_contextto satisfy linters.- async def best_worker_id(self, request, context): + async def best_worker_id(self, request, _context):
127-133: Catch broad exceptions with traceback.Switch to
logger.exceptionto preserve details. Consider narrowing exception types later.- except Exception as e: - logger.error(f"Error finding best worker: {e}") + except Exception: + logger.exception("Error finding best worker") yield { "status": "error", - "message": str(e), + "message": "internal error", }
198-199: Also log traceback on serve failure.- except Exception as e: - logger.error(f"Failed to serve endpoint: {e}") + except Exception: + logger.exception("Failed to serve endpoint")lib/bindings/python/rust/llm/kv.rs (1)
845-885: Nice: centralize request→stream conversion.Reduces duplication and isolates Pythonization. Consider closing the sender explicitly after loop to unblock consumers sooner (minor).
tokio::spawn(async move { let mut stream = stream; while let Some(response) = stream.next().await { @@ } + // Explicitly drop sender to signal completion + drop(tx); });
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
benchmarks/router/ping.sh(1 hunks)benchmarks/router/prefix_ratio_benchmark.py(1 hunks)benchmarks/router/real_data_benchmark.py(1 hunks)benchmarks/router/run_engines.sh(5 hunks)components/backends/vllm/launch/agg_router.sh(1 hunks)components/backends/vllm/launch/disagg_router.sh(1 hunks)components/backends/vllm/src/dynamo/vllm/handlers.py(3 hunks)components/backends/vllm/src/dynamo/vllm/main.py(6 hunks)components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py(1 hunks)components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py(1 hunks)lib/bindings/python/rust/llm/kv.rs(5 hunks)lib/bindings/python/src/dynamo/_core.pyi(0 hunks)lib/bindings/python/src/dynamo/llm/__init__.py(1 hunks)lib/llm/src/kv_router.rs(7 hunks)lib/llm/src/kv_router/scheduler.rs(4 hunks)
💤 Files with no reviewable changes (1)
- lib/bindings/python/src/dynamo/_core.pyi
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.392Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.313Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/bindings/python/rust/llm/kv.rs
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/llm/__init__.py (2)
lib/bindings/python/rust/lib.rs (1)
_core(69-130)lib/bindings/python/src/dynamo/_core.pyi (1)
KvPushRouter(1157-1256)
components/backends/vllm/src/dynamo/vllm/handlers.py (4)
lib/bindings/python/src/dynamo/runtime/logging.py (1)
configure_dynamo_logging(77-107)lib/bindings/python/src/dynamo/_core.pyi (5)
instance_ids(256-263)generate(1178-1210)get(137-150)direct(286-290)round_robin(280-284)lib/bindings/python/rust/lib.rs (5)
instance_ids(904-906)generate(923-935)get(608-624)direct(1005-1043)round_robin(939-968)lib/bindings/python/rust/llm/kv.rs (2)
anext(1110-1122)generate(946-1013)
components/backends/vllm/src/dynamo/vllm/main.py (4)
components/backends/vllm/src/dynamo/vllm/args.py (1)
Config(39-64)examples/multimodal/utils/args.py (1)
Config(25-40)lib/bindings/python/src/dynamo/_core.pyi (9)
component(191-195)ZmqKvEventPublisher(811-825)endpoint(210-214)ZmqKvEventPublisherConfig(793-809)lease_id(243-247)block_size(625-629)block_size(648-652)namespace(38-42)client(237-241)examples/multimodal/components/worker.py (1)
setup_vllm_engine(117-174)
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (4)
lib/bindings/python/src/dynamo/_core.pyi (12)
KvPushRouter(1157-1256)KvRouterConfig(867-869)Client(249-290)DistributedRuntime(31-62)block_size(625-629)block_size(648-652)component(191-195)endpoint(210-214)client(237-241)best_worker_id(1212-1229)instance_ids(256-263)get(137-150)lib/bindings/python/src/dynamo/runtime/__init__.py (1)
dynamo_worker(36-62)lib/bindings/python/src/dynamo/runtime/logging.py (1)
configure_dynamo_logging(77-107)lib/bindings/python/rust/llm/kv.rs (3)
block_size(436-438)block_size(497-499)best_worker_id(1031-1058)
lib/llm/src/kv_router.rs (6)
lib/bindings/python/rust/llm/kv.rs (1)
drop(228-230)lib/llm/src/kv_router/sequence.rs (1)
drop(854-864)lib/llm/src/kv_router/approx.rs (1)
drop(404-406)lib/llm/src/block_manager.rs (1)
drop(86-88)lib/llm/src/kv_router/publisher.rs (1)
drop(175-177)lib/llm/src/recorder.rs (1)
drop(377-379)
lib/bindings/python/rust/llm/kv.rs (4)
lib/bindings/python/src/dynamo/_core.pyi (3)
KvPushRouter(1157-1256)new(117-135)component(191-195)components/backends/sglang/src/dynamo/sglang/protocol.py (1)
PreprocessedRequest(36-43)lib/llm/src/kv_router.rs (3)
new(137-161)new(224-321)new(482-487)lib/runtime/src/component.rs (2)
component(428-430)component(575-581)
🪛 Ruff (0.13.1)
components/backends/vllm/src/dynamo/vllm_prefill_router/__init__.py
6-6: Do not catch blind exception: Exception
(BLE001)
11-11: Do not catch blind exception: Exception
(BLE001)
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py
69-69: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
72-72: Unused method argument: context
(ARG002)
110-110: Abstract raise to an inner function
(TRY301)
110-110: Avoid specifying long messages outside the exception class
(TRY003)
127-127: Do not catch blind exception: Exception
(BLE001)
128-128: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
198-198: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3155/merge) by PeaBrane.
components/backends/vllm/src/dynamo/vllm/handlers.py
[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/handlers.py
components/backends/vllm/src/dynamo/vllm/main.py
[error] 1-1: isort: files were modified by this hook. Fixing /home/runner/work/dynamo/dynamo/components/backends/vllm/src/dynamo/vllm/main.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
🔇 Additional comments (25)
benchmarks/router/real_data_benchmark.py (1)
121-121: Port change to align with benchmark defaults.The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.
benchmarks/router/prefix_ratio_benchmark.py (1)
312-312: Port change to align with benchmark defaults.The default URL port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.
benchmarks/router/ping.sh (1)
6-7: Port change to align with benchmark defaults.The default port has been updated from 8080 to 8000, which aligns consistently with the broader benchmark tooling changes across the router benchmarks.
benchmarks/router/run_engines.sh (9)
11-12: New configuration variables for prefill router support.The addition of
USE_PREFILLSandBASE_GPU_OFFSETvariables provides the foundation for configuring prefill workers and GPU allocation offsets.
34-41: CLI argument parsing for prefill router features.The new command line options
--prefillsand--base-gpu-offsetenable users to configure prefill worker mode and GPU allocation offset respectively.
84-87: Validation for BASE_GPU_OFFSET parameter.The validation ensures
BASE_GPU_OFFSETis a non-negative integer, preventing invalid GPU allocation configurations.
91-91: GPU range calculation with offset support.The
LAST_GPUcalculation now accounts for theBASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets.
94-94: Enhanced configuration display for prefill workers.The configuration output now clearly indicates worker type (Prefill vs Decode) and shows the GPU range, improving operational visibility.
Also applies to: 99-99
114-115: Worker type labeling for improved logging.The
WORKER_TYPEvariable enables consistent labeling throughout the script, making it easier to distinguish between prefill and decode workers in logs.
119-119: Consistent worker type labeling in status messages.The status messages now clearly indicate whether workers are prefill or decode type, improving operational clarity.
Also applies to: 146-146, 161-161
122-122: GPU allocation with base offset support.The
START_GPUcalculation now incorporatesBASE_GPU_OFFSET, enabling proper GPU allocation when using GPU offsets for worker deployment.
148-157: vLLM argument construction with prefill worker support.The refactored argument building approach uses an array to construct vLLM arguments, conditionally adding
--is-prefill-workerwhen prefill mode is enabled. This provides a clean and maintainable way to handle the conditional argument passing.components/backends/vllm/launch/agg_router.sh (1)
7-8: Good addition of deterministic hashing for KV event IDs.Setting
PYTHONHASHSEED=0ensures consistent hash values across runs, which is important for KV event ID generation and debugging.lib/bindings/python/src/dynamo/llm/__init__.py (1)
28-28: LGTM! KvPushRouter export aligns with updated bindings.The addition of
KvPushRouterto the public API is consistent with the broader prefill router changes in the PR.components/backends/vllm/src/dynamo/vllm/main.py (1)
91-123: Well-structured KV event publisher setup.The new
setup_kv_event_publisherhelper function properly encapsulates the KV publisher initialization logic with good error handling and configuration.components/backends/vllm/launch/disagg_router.sh (1)
38-49: Syntax error: Missing backslash for line continuation.Line 48 is missing a backslash for proper line continuation in the bash script.
Apply this diff to fix the syntax error:
CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.vllm \ --model $MODEL \ --block-size $BLOCK_SIZE \ - --enforce-eager \ + --enforce-eager \ --is-prefill-workerLikely an incorrect or invalid review comment.
lib/llm/src/kv_router.rs (3)
616-621: Graceful shutdown via cancellation token — LGTM.Matches existing patterns elsewhere and prevents background task leaks.
354-363: Propagatingcontext_idasOption<String>to scheduler — LGTM.Keeps state updates conditional and aligns with new
SchedulingRequestsemantics.
323-333: find_best_match signature change — original concern is incorrectInternal call sites were updated to the new signature (lib/llm/src/kv_router.rs — calls around lines 454 and 558). The public wrapper still exposes the old two-arg API and delegates with None (lib/llm/src/kv_router.rs:490–497), so the Python binding call at lib/bindings/python/rust/llm/kv.rs:1049–1052 (.find_best_match(&token_ids, router_config_override.as_ref())) matches the public API and does not need changes.
Likely an incorrect or invalid review comment.
lib/bindings/python/rust/llm/kv.rs (3)
915-921: Good preflight: prevent KV routing in static mode.LGTM; avoids subtle runtime errors.
1011-1013: Reuse helper for generate path — LGTM.Keeps binding thin and consistent.
1030-1057: API alignment verified — no action required.lib/bindings/python/src/dynamo/_core.pyi defines async def best_worker_id(self, token_ids: List[int], router_config_override: Optional[JsonLike] = None) -> Tuple[int, int], and the pyo3 binding in lib/bindings/python/rust/llm/kv.rs has #[pyo3(signature = (token_ids, router_config_override=None))]; the caller components/backends/vllm/src/dynamo/vllm_prefill_router/main.py invokes best_worker_id(token_ids=token_ids) (no override) which is compatible.
lib/llm/src/kv_router/scheduler.rs (2)
251-257: Fail fast when update_states=true and request_id is missingSilently skipping state updates biases scheduling — return an error to callers (fail fast) or at minimum log at warn and include the update_states flag in the message. Would you prefer returning Err(KvSchedulerError::AllWorkersBusy) or adding a new KvSchedulerError variant to surface this to callers?
58-71: Resolved — callers verified; no changes requiredOnly schedule() is called at lib/llm/src/kv_router.rs:355. generate() calls find_best_match with Some(context_id) and update_states=true; KvPushRouter calls with update_states=false. Scheduler's requirement that update_states=true implies Some(request_id) is satisfied.
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py (1)
52-62: Confirmed: KvRouterConfig kwargs match binding initializerrouter_track_active_blocks and router_reset_states are exposed in the Python binding (pyo3 signature in lib/bindings/python/rust/llm/entrypoint.rs) and match the Rust KvRouterConfig fields, so the KvRouterConfig(...) call is valid.
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
components/backends/vllm/src/dynamo/vllm_prefill_router/__main__.py
Outdated
Show resolved
Hide resolved
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Jason Zhou <jasonzho@nvidia.com>
Signed-off-by: PeaBrane <yanrpei@gmail.com> Signed-off-by: Kyle H <kylhuang@nvidia.com>
Overview:
A vllm prefill router can be launched with
python -m dynamo.vllm_prefill_router. Eventually, we need to make this engine agnostic. Not dealing with this for now, and keeping the PR somewhat small. Closes #1895Core changes
We first exposed via Python bindings
KvRouterdirectly, particularly thefind_best_matchandfreemethods. These should not be used by typical users, so we are not adding docs for them.Added the
vllm_prefill_routerservice, with the flow being:find_best_matchendpoint to get the best prefill worker id via the router clientfreeendpoint to free the request tracking from the prefill routerNote, this is actually not the ideal flow, and it is cleaner to use the
KvPushRouterdirectly, so the 3 steps are handled underneath the hood automatically. However, currently, this would require intrusive changes toPreprocessedRequestto support remote-prefill-specific extra args, so I am delaying doing this to a future PR where we would have a unified field for these args, and have all 3 frameworks using it consistently. (See #3159 )Benchmarking
Done with 4P4D, on A100, 8b model, concurrency of 20
Done with 5P3D, on A100, 8b model, concurrency of 20