-
Notifications
You must be signed in to change notification settings - Fork 689
Closed
Labels
backend::vllmRelates to the vllm backendRelates to the vllm backendbacklogIssues or bugs that will be tracked for long term fixesIssues or bugs that will be tracked for long term fixesbugSomething isn't workingSomething isn't working
Description
If the KV router side is restarted it will no longer know the full KV block state of the workers. It will not be able to find parent blocks for new blocks, and KV routing will not work.
We need a way for KV router to ask the engine to resend state of all it's KV blocks.
Progress:
- Allow the KvIndexer to dump the RadixTree snapshot feat: dump radix tree as router events #2057
- Allow the etcd to be aware of Router replicas feat: register Kv router instance into etcd #2548
- Allow direct upload and download of Rust structs via NATs object store feat: upload/download rust structs directly through NATs object store #2540
- Allow NATs queue to handle multiple consumers and purge up to the ACK floor feat: allow specifying consumer name for NATS queue + manually purge old messages #2740
- Hook everything up. Kv events should be published over jetstream (NATs queue). Periodically, a router replica will be selected to save a RadixTree snapshot and purge up to the snapshot. When a Router is brought up, it will first load the radix state and read the stream from the purged watermark. feat: Router warm restarts via durable KV event consumers and radix snapshotting #2756
wxsms, biswapanda and PeaBrane
Metadata
Metadata
Assignees
Labels
backend::vllmRelates to the vllm backendRelates to the vllm backendbacklogIssues or bugs that will be tracked for long term fixesIssues or bugs that will be tracked for long term fixesbugSomething isn't workingSomething isn't working