-
Notifications
You must be signed in to change notification settings - Fork 690
feat: add RuntimeConfig to ModelEntry #2311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add RuntimeConfig to ModelEntry #2311
Conversation
|
👋 Hi jorgeantonio21! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis change introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant Engine
participant Worker
participant Endpoint
participant Etcd
Worker->>Engine: Query runtime parameters
Worker->>Worker: Build ModelRuntimeConfig
Worker->>Endpoint: register_runtime_config(model_name, config)
Endpoint->>Etcd: Update ModelDeploymentCard with runtime config
Etcd-->>Endpoint: Ack
Endpoint-->>Worker: Ack
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related issues
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Co-authored-by: Yan Ru Pei <yanrpei@gmail.com> Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Overview:
This PR implements runtime configuration registration for vLLM and SGLang backends, adding a middle layer between static configs and dynamic metrics. Runtime configs are computed during engine initialization but remain constant during the session, providing accurate resource information for better routing and resource management.
Details:
New Features:
ModelRuntimeConfigstructure to store runtime-computed values liketotal_kv_blocks,max_num_seqs, andgpu_memory_utilizationModelRuntimeConfigis added as a new field toModelEntry, so it would get registered to etcd onLocalModel.attach(), a shared routine.KvRouterwill have a new prefix watcher onmodels/to listen to the runtime configs of all workers in a push-based manner, in order to update its knowledge of the total load capacity of each worker. (Note that this is not yet used for request rejection yet, to be implemented in a future PR.)vLLM Implementation:
engine_client.engine.cache_config.num_gpu_blocksengine_client.vllm_config.scheduler_config.max_num_seqsengine_client.engine.cache_config.gpu_memory_utilizationSGLang Implementation:
engine.get_server_info()to get actual computed values from SGLang engineMocker Implementation
total_kv_blocksandmax_num_seqsare passed in asextra_engine_argsfor the mocker engine, so onLocalModelBuilder.build(), they will override the runtime configWhere should the reviewer start?
Core Implementation:
lib/llm/src/local_model/runtime_config.rs- New runtime config structurelib/llm/src/local_model.rs- register the runtime config as part of theModelEntryonattach()lib/llm/src/kv_router/scheduler.rs- handles both instance and model watchersvLLM Integration:
components/backends/vllm/src/dynamo/vllm/main.pySGLang Integration:
components/backends/sglang/src/dynamo/sglang/worker/main.pyRelated Issues:
Summary by CodeRabbit
New Features
Bug Fixes