[Core] Encoder separation for Encode-Prefill-Decode Disaggregation #21740

fake0fan · 2025-07-28T10:28:27Z

MLLM Encode separation and E-P Encoder Cache Transfer

Encode-Prefill-Decode disaggregation provides greater flexibility in distributed MLLM inference, enables better resource utilization under fixed TTFT and TPOT Service Level Objectives, and allows for the application of stage-level optimizations. To implement the EPD full disaggregation, we need to move the visual encoder to a separate instance.

This update introduces the implementation of MLLM visual encoder separation, an abstraction for inter-instance (E-P) communication for encoder cache transfer, and a concrete example implementation of disaggregated E+PD serving.

Motivation

Encoder separation is a critical part of EPD disaggregation, as it allows the visual encoder to be decoupled from the Prefill and Decode stages. To implement and use EPD disaggregation in future we need a visual encoder separation.
Better serving for MLLM. Consider mixed inputs continuous serving scenario, every 10th request includes a large multimodal input, while all others are text-only. In the current vLLM, all requests in a batch in model execution will wait for the multimodal input embedding generation to complete. When new request with the image arrive, all requests will again wait for the multimodal (MM) encoder to run, causing significant performance degradation. By separating the encoding stage, we can avoid these bottleneck. This scenario is used only to highlight the bottleneck, same bottleneck appears in other serving scenarios and can be fixed by separated encoder.

flowchart LR
    subgraph "Current vLLM Architecture"
        direction LR
        

        subgraph I2 ["Iteration 2"]
            direction LR
            R2["Requests 11-19: Text<br/>🖼️ Request 20: MM + Text"]
            R3["Running Batch 1-10"]
            S2[Scheduler]
            B2["Batch 1-20"]
            MR2[ModelRunner]
            MM2["execute_mm_encoder<br/>ALL requests blocked AGAIN"]
            LM2["Language Model<br/>1 generation step"]
            
            R2 --> S2
            R3 <--> S2
            S2 --> B2 --> MR2 --> MM2 --> LM2
        end

        subgraph I1 ["Iteration 1"]
            direction LR
            R1["Requests 1-9: Text<br/> Request 10: MM + Text"] 
            S1[Scheduler]
            B1["Batch 1-10"]
            MR1[ ModelRunner]
            MM1["execute_mm_encoder<br/> ALL requests blocked"]
            LM1["Language Model<br/>1 generation step"]
            
            R1 --> S1 --> B1 --> MR1 --> MM1 --> LM1
        end

    end

Overall Process

The separated encode overall process in 1E1PD with proxy scenario:

The client sends an HTTP request to the Proxy/Router's /v1/completions interface.
The Proxy/Router selects a 1E1PD (1 Encode instance + 1 Prefill+Decode instance) generates a request_id.
Proxy sends request to E instance.
When request is scheduled on the E instance, the request's metadata is sent to PD instance through Encoder Cache connector.
Then multimodal embeddings are generated for request's multimodal inputs, the multimodal embeddings are scheduled for the transfer in encoder cache connector on E instance and empty response is returned to proxy immediately after generation of the multimodal embeddings is completed.
After receiving response from E instance, the Proxy/Router forwards the request to the PD instance. As encoder budget is set to 0 on PD instance, it can't schedule chunks with multimodal input and waits the injection of encoder cache.
When there is enough space in the encoder cache manager on the PD instance, instance preallocates the space for request and sends a preallocation notification to the E instance. If the encoder cache is not required (i.e., tokens are reused via the KV cache), it sends a "failed" preallocation notification, indicating that encoder cache transfer is not required.
After preallocation notification is received on E instance's EC connector the encoder cache is sent from E instance to the PD instance.
PD instance receives cache and finalize allocation of the multimodal input data and now is able to schedule the request. As encoder cache is already calculated PD instance skips MLLM's encode step and uses the value from cache.
After completing Prefill and Decode, the PD instance returns the result to the Proxy/Router, which then forwards it to the client.

sequenceDiagram
    participant P as Proxy
    participant ES as Encode Scheduler
    participant EVW as DisaggVModelGPURunnerWrapper
    participant Redis as Redis
    participant PDW as DisaggLModelGPURunnerWrapper
    participant PDS as Prefill+Decode Scheduler

    P->>ES: Request
    ES->>ES: Schedule request
    ES->>Redis: Send encoder cache metadata
    ES->>EVW: SchedulerOutput
    EVW->>EVW: _execute_mm_encoder() - Process vision/MM inputs
    Redis->>PDS: Receive metadata
    EVW->>EVW: Store in encoder_cache temporarily
    EVW->>ES: ModelRunnerOutput
    ES->>P: Empty response
    P->>PDS: Request
    alt Multimodal input tokens KV values are not obtained from KV cache
        PDS->>PDS: Preallocate encoder cache space
    else 
    end
    PDS->>Redis: Send preallocation response
    Redis->>EVW: Receive preallocation
    alt Multimodal input tokens KV values are not obtained from KV cache
    	EVW->>Redis: Send actual encoder cache data
        Redis->>PDW: Receive encoder cache
    	PDW->>PDW: Inject into encoder_cache dict
    else Multimodal input tokens KV values are obtained from KV cache
    end
    EVW->>ES: ModelRunnerOutput
    ES->>ES: Free slots in Encoder Cache Manager
    PDW->>PDS:ModelRunnerOutput
    alt Multimodal input tokens KV values are not obtained from KV cache
    	PDS->>PDS: Finalize allocation
    else Multimodal input tokens KV values are obtained from KV cache
       PDS->>PDS: Free preallocated encoder cache space
    end
    PDS->>PDW: SchedulerOutput
    Note over EVW,PDW: Phase 3: Normal LM Inference
    PDW->>PDW: Execute prefill + decode
    PDW->>PDW: Generate text output
    PDW->>PDS: ModelRunnerOutput
    PDS->>P: Response

Implementation

The implementation avoids large changes. Modifications to the model runner are minimal, and the main functionality is provided through model runner wrapper classes, which are only instantiated during the execution of the separated encoder. Same with scheduler, changes in the scheduler are all handled by the condition if self.separated_encode, ensuring that the default VLLM serving remains unaffected. These changes only take effect during execution with the separated multimodal encoder

vLLM minor changes

EPD Disaggregation Configuration

Files: vllm/config.py, vllm/core/arg_utils.py

Added a new configuration class for EPD disaggregation. Currently supports configuration of instance type, instance's EPD rank and the number of connector workers.

Additional ModelRunnerOutput Data Fields

Files: vllm/v1/outputs.py

The model runner output now includes two additional data fields: transfered_mm_data and injected_mm_data.

The transfered_mm_data field passes a list of transfered encoder cache input IDs from the model runner to the scheduler on the encode instance. After receiving transfered data IDs, the scheduler will clear free space in the encoder cache manager.

The injected_mm_data field passes a list of injected encoder cache input IDs from the model runner to the scheduler on the prefill instance. After receiving injected data IDs, the scheduler will clear free space in the encoder cache manager.

Model Runner Wrapper Integration in GPUWorker

Files: vllm/v1/worker/gpu_worker.py

When EPD disaggregation is enabled, the system uses wrapper classes of GPUModelRunner class.

GPU Model Runner sanity check in encoder execution

Files: vllm/v1/worker/gpu_model_runner.py

If EPD disaggregated serving is enabled, an additional attribute is added to indicate whether encoder execution is allowed. This attribute is used to perform a sanity check on each execution of the encoder.

Major Changes

EncoderCacheManager new Allocation Logic

Files: vllm/v1/core/encoder_cache_manager.py

The EncoderCacheManager now contains methods for a two-step allocation process for remote encoder cache injection. This change introduces 4 new methods designed around the concept of preallocation, to reserve cache space for a request without immediately adding it to the cached dictionary.

The allocation process operates in two steps. In preallocation, the system reserves the required space by deducting it from the available encoder cache size. In allocation finalization, the request completes its allocation in the update_from_output scheduler function after successful encoder cache injection in the model runner.

To implement such approach we need following functions:

can_preallocate(cache_size) - checks if sufficient space exists for preallocation using only the request's encoder cache size data.

preallocate(req_id, input_id, cache_size) - takes cache_size slots in the encoder cache manager and adds the information that (req_id, input_id) is preallocated.

depreallocate(req_id, input_id) - rolls back the preallocate action by freeing the allocated slots associated with (req_id, input_id). Used when we skip the multimodal input via prefix cache usage.

finalize_allocation(req_id, input_id) - finalizes the allocation process by adding the (req_id, input_id) to the cached mapping, called after successful encoder cache injection to complete the allocation that was started with preallocate().

The two-step allocation ensures that sufficient space will exist in the encoder cache manager for incoming requests after preallocation notification is sent. It also prevents direct allocation in the encoder cache manager until the actual cache injection occurs.

Since the complete request is not transfered from encode instance to prefill instance. This methods determines are using only the request's metadata.

The manager also includes a deallocate(self, req_id: str, input_id, encoder_cache_size)method for encode instance to release cache space in encoder scheduler, we need this method because the encoder instance finishes request before encoder cache transfer.

EncoderCachePreallocator

Files: vllm/separated_encode/sched/encoder_cache_preallocator.py

The EncoderCachePreallocator system introduces a classes for managing encoder cache preallocation across distributed instances. This implementation provides two distinct strategies for handling encoder cache metadata and coordinating allocation decisions between encoder and scheduler instances.

Implementation provides an abstract base class EncoderCachePreallocatorTemplate that defines the core interface for encoder cache preallocation management. This template defines abstract methods for request lifecycle management and create encoder cache connector object in init for receiving encoder cache metadata.

Preallocator will asynchronously receive encoder cache metadata through receive_encoder_cache_metadata() callback, detailed preallocator behaivour is determined by concrete implementation, but in general it always tries to schedule preallocation of (req_id, input_id), if the (req_id, input_id) is schedulable, then it adds (req_id, input_id) to preallocation queue, that will be used in get_prealloc_candidate.

On each request addition and finish the preallocator's corresponding function is called to handle initialization and cleaning.

As the scheduler processes tokens, it continuously updates the multimodal input completion status through preallocator's update_mm_inputs_done() method. This method determines which multimodal inputs have been fully processed based on their position in the token sequence and the number of computed tokens, that allows us to control which encoder caches are not required now.

Two concrete implementations provide different allocation strategies:

AsyncEncoderCachePreallocator provides asynchronous approach that immediately triggers preallocation callbacks upon receiving encoder cache metadata. This implementation maintains minimal state and always accepts encoder cache, this allows to avoid additional request state tracking and synchronous approach.

SyncEncoderCachePreallocator implements a synchronous approach with state tracking. It tracks active requests, pending preallocation requests, waiting preallocation metadata, and ignored preallocation entries, to decide whether instance needs to accept encoder cache, or we can reject it and use data from KV cache.

Both implementations track multimodal input progress through mm_inputs_done and mm_inputs_total counters, updating completion status as tokens are computed, to clear the encoder cache after injection or to avoid accepting encoder cache for the mm input tokens that are covered by prefix cache. The get_prealloc_candidate method provides a interface for retrieving the next preallocation candidate based on available cache space, with options for immediate next candidate filling. If instance can't get candidate(either no candidates/no slots) it returns false.

The preallocation queue operates on a first-come, first-served basis, with candidates being processed based on available encoder cache space. The system tracks which preallocations are pending to avoid duplicate processing and to ensure proper cleanup when preallocations become unnecessary.

The preallocation system coordinates with encoder cache connectors to send preallocation notifications, enabling distributed coordination between instances that generate cache data and instances that manage allocation decisions.

sequenceDiagram
    participant E as Encode Instance
    participant PD as Prefill+Decode Instance

    E->>PD: "I have 2048 tokens of encoder cache to send"
    Note over PD: Check available memory
    PD->>PD: encoder_cache_manager.can_preallocate(2048 tokens)
    alt Multimodal input tokens KV values are not obtained from KV cache
        PD->>PD: encoder_cache_manager.preallocate(req_id, input_id, 2048 tokens)
        PD->>E: "I reserved 2048 tokens for you"
        Note over E: Now safe to send
        E->>PD: Send actual 2048 tokens data
        Note over PD: After successful injection
        PD->>PD: encoder_cache_manager.finalize_allocation(req_id, input_id)
    else Multimodal input tokens KV values are obtained from KV cache
        PD->>PD: encoder_cache_manager.depreallocate(req_id, input_id)
        PD->>E: "No need to send"
    end

Encoder Scheduler

Encoder Scheduler (encode)

Files: vllm/separated_encode/sched/encoder_scheduler.py

Separate EncoderScheduler class implementation is provided for encode instance scheduling, while prefill and prefill+decode instances continue to use the main Scheduler class.

The EncoderScheduler is a specialized scheduler for encode instances that focuses on multimodal input processing. It maintains an _allocated dictionary to track allocated encoder cache entries and their sizes, this dictionary is used to allow us to finish the encode request before all related transfers are completed.

The encode scheduler schedules all multimodal inputs for a request at once in the schedule() method. It checks if there's sufficient encoder cache space and budget before allocating all inputs together. A request on the encode instance is considered finished when all its multimodal embeddings have been computed, so all requests are finished in 1 iteration after scheduling, transfer is handled separately in encoder cache connectors, space allocated for encoder cache is deallocated only after succesfull transfers, not after request finish.

In the update_from_output() method, the scheduler goes throguh transferred multimodal data IDs and deallocates the corresponding encoder cache entries.

Main Scheduler (prefill and prefill+decode instances)

Files: vllm/v1/core/sched/scheduler.py

For prefill and prefill+decode instances, the main scheduler is changed for multimodal inputs encode separation, instance max_num_encoder_input_tokens value is set to 0 to avoid multimodal inputs encoder execution.

If current instance is the prefill(P) or prefill+decode(PD) instance, then we instantiate preallocator object in scheduler, this preallocator will manage communication and preallocation, also we set max_num_encoder_input_tokens to 0 to avoid the usage of the multimodal data encoder on P or PD instance.

Mostly main scheduler has 2 changes, integration of encoder cache preallocator and _perform_preallocations() into request lifecycle and injected data handling.

The integration of ec_preallocator is described in the corresponding part of the documentation. For the _perform_preallocations() function, this function is used to connect ec_preallocator, which manages which requests will be preallocated, and the encoder cache manager, which actually performs preallocations. This function just keeps preallocations until there are enough slots in the encoder cache manager and enough preallocation requests. By default, _perform_preallocations() is called 2 times: in update_after_schedule() after freeing some encoder inputs, and in update_from_output after handling injected data.

The injected data handling is performed with injected_mm_data obtained from ModelRunnerOutput, scheduler is going through injected data and decides whether the allocation is finalized or we don't need the obtained data anymore and we can depreallocate it.

Such an implementation with is designed to avoid redundant processing or injection, ensure that scheduler will have enough slots after encoder cache arriving and ensure the deletion of encoder caches that cannot be used due to prefix caching or early request abortion.

Instance-Specific Model Runner Wrappers

Files: vllm/separated_encode/worker/gpu_epd_lm_wrapper.py, vllm/separated_encode/worker/gpu_epd_vm_wrapper.py

The implementation introduces specialized GPU model runner wrappers for disaggregated architecture, focusing on distinct roles for multimodal encoding and text generation. These wrappers are built on top of the GPUModelRunner for better compatibility with future changes in GPUModelRunner. As long as the v1 interface for GPU Model Runner remains unchanged, the wrappers do not require updates, wrapper simply call the original methods, instantiate the encoder cache connector, track information, and modify the model runner output with EPD-related information.

DisaggVModelGPURunnerWrapper (Encode Instance)

This wrapper runs on encode instances and processes multimodal inputs. It executes encoder models and sends the results to other instances through encoder cache connector.

The encode instance doesn't need KV cache since it only runs vision part of MLLM. The wrapper overrides initialize_kv_cache_tensors and initialize_kv_cache to return empty results, freeing up GPU memory for larger encoder cache storage.

During execution, the wrapper executes encoding for scheduled multimodal inputs, converts outputs to numpy arrays and inserts enocder cache to encoder cache connector, also it stores. Since no text generation happens here, it returns a mostly empty ModelRunnerOutput with additional transfer status information in ModelRunnerOutput, this information is used in encoder scheduler to free the space in encoder cache manager.

DisaggLModelGPURunnerWrapper (Prefill/(Prefill+Decode) Instance)

This wrapper runs on prefill or (prefill+decode) instances where the Language Model is exectued. It receives encoder cache from encode instances and injects them into the normal inference pipeline.

The wrapper uses a callback function receive_encoder_cache to handle incoming encoder data. This callback converts numpy arrays back to GPU tensors with the correct device and dtype, then stores them in the standard encoder_cache dictionary.

During execute_model, the wrapper simply calls execute_model, and also tracks which encoder caches were injected. It reports successful injections back to the scheduler through the model output, allowing the scheduler to finalize allocations of preallocated inputs.

Encoder Cache Connector

Files: vllm/separated_encode/ec_transfer/connector/template.py, vllm/separated_encode/ec_transfer/connector/redis.py

The Encoder Cache Connector provides an abstraction layer for transferring encoder caches between encode and prefill instances in disaggregated vLLM deployments. The abstract base class ECConnectorTemplate defines the communication logic.

The connector operates using a thread-based architecture with separate send and receive event loops. Communication is handled asynchronously through configurable worker pools. It maintains separate queues for send and receive operations, with each operation executed by dedicated worker threads.

The encoder connector operates in four distinct states based on instance type and its component:

State for Encode Scheduler - Pure sender functionality that handles encoder cache metadata transfer. When multimodal input is scheduled, metadata sending tasks are added to the send queue for processing by the send event loop.

State for Prefill Scheduler - Receives encoder cache metadata from encode instances and manages preallocation through scheduler callbacks. The preallocation logic is described in scheduler updates. After successful preallocation, sends completion notifications back to encode instances from which it received the metadata.

State for Encode Model Runner - Manages cache storage, transfer, and lifecycle. It maintains:

encoder_cache: Dictionary storing computed encoder caches
cache_to_send: Set of pending cache transfers awaiting preallocation confirmation
cache_to_avoid: Set of caches that failed preallocation
transfered_ids: List tracking successfully transferred cache IDs

When encoder output is generated, add_encoder_cache() either stores the cache locally or immediately schedules transfer if a succesfull preallocation notification was already received. Upon receiving succesfull preallocation notifications via _maybe_send_encoder_cache(), it either sends the cache immediately or adds the request to the pending set. It can receive failed preallocation notification, it means that we don't need to send encoder cache to this instance and can delete the encoder cache for this (req_id, input_id) from the Encoder instance.

State for Prefill Model Runner - Receive-only state that accepts encoder cache data and calls injection callbacks to add the cache into the model runner's encoder cache dictionary.

The communication flow follows this sequence:

Encode Scheduler sends metadata to Prefill Scheduler for cache preallocation
Prefill Scheduler attempts preallocation and sends success/failure notifications to Encode Model Runner
Upon successful preallocation, Encode Model Runner transfers the actual encoder cache data to Prefill Model Runner

The preallocation step is implemented to avoid OOM problems on the prefill instance.

The add_encoder_cache method also looks up pending send queries. If a cache transfer has already been requested (exists in cache_to_send set), it immediately schedules the send operation rather than storing the cache locally. This optimization minimizes latency by sending caches as soon as both prerequisites are met. Failed preallocations are tracked in cache_to_avoid to prevent unnecessary transfers

Transfer completion tracking is built into the class. Through the connector's get_transferred_ids method, the model runner can determine which request data has already been received.

graph LR
    subgraph "Encode Instance"
        ES[Encode Scheduler]
        EMR[Encode Model Runner]
    end
    
    subgraph "Prefill Instance" 
        PS[Prefill Scheduler]
        PMR[Prefill Model Runner]
    end
    
    
    ES -.->|metadata| PS
    EMR -.->|cache data| PMR
    PS -.->|notification| EMR
    PMR -.->|injected IDs| PS
    EMR -.->|transfered IDs| ES

Extension Example

The included RedisECConnector demonstrates a concrete implementation using Redis as the communication backend. To use other communication backends, implement the abstract methods _send_prealloc_notification, _send_encoder_cache_metas, _send_encoder_cache, _recv_prealloc_notification, _recv_encoder_cache_metas, and _recv_encoder_cache according to your chosen transport mechanism. This connection extension supports multiple E instances and multiple PD or P instances.

Usage Instructions

The system offers three different deployment options through separate scripts. The first two scripts, launch_epd_api_server.sh and launch_epd_serve.sh, are designed to run the 1E1PD disaggregation on a single device and are used in the benchmark below to illustrate the performance of the encoder separation. The third script, launch_epd_serve_separated.sh, enables 1E1PD disaggregation by deploying EPD across two devices.

To start the EPD instances and proxy server, select one of the provided scripts and modify the arguments as needed before execution. You can run the deployment using any of these commands:

bash examples/online_serving/separated_encode/launch_epd_serve.sh

bash examples/online_serving/separated_encode/launch_epd_serve_separated.sh

bash examples/online_serving/separated_encode/launch_epd_api_server.sh

After the server starts running, you can interact with it using OpenAI-compatible API requests to send queries and receive responses. Sample Python code for sending requests is available in the examples/online_serving/separated_encode/ directory, providing practical examples for integration.

The distinction between launch_epd_api_server.sh and launch_epd_serve.sh. Communication between the proxy and vLLM API introduces significant delays, likely due to the proxy implementation. For benchmarking purposes, launch_epd_api_server.sh was preferred as it uses a custom API server that instantiates AsyncLLMs directly, bypassing the proxy overhead. Both the proxy and API server implementations can be found in their respective directories: examples/online_serving/separated_encode/api_server and examples/online_serving/separated_encode/proxy.

The API server implementation allows sending both requests simultaneously without waiting for the response from instance E and enables shared tokenization process.

Benchmark

Performance evaluation was conducted using Qwen2.5-VL-3B-Instruct on an NVIDIA A100-SXM4-80GB GPU, comparing default vllm serve against an 1E1PD with API server approach(1 GPU is used). Testing utilized the lmarena-ai/VisionArena-Chat dataset with varying prompt loads from 100 to 1000 requests to assess scalability characteristics.

python benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model $MODEL \
  --dataset-name hf \
  --dataset-path $DATASET \
  --hf-split train \
  --num-prompts $NUM_PROMPTS \
  --seed 40 \
  --save-result \
  --save-detailed \
  --result-dir $LOG_PATH/vision_arena_results \
  --result-filename vision_arena_outputs$(date +"%Y%m%d_%H%M%S").json \
  --port 10001 > $LOG_PATH/benchmark_VisionArena_$(date +"%Y%m%d_%H%M%S").log 2>&1

Approaches were becnhmarked exactly three times for 4 workloads. The detailed individual run results are provided at the end of this document. The following table presents the averaged values across the three runs:

Approach	num_prompts	Benchmark duration (s)	Total input tokens	Total generated tokens	Mean TTFT (ms)	Mean ITL (ms)
EPD + API server	1000	98.79	92971	106523.7	57026.25	104.96
Default	1000	104.58	92971	106839.7	52346.16	142.42
EPD + API server	500	51.42	60877	54107.67	28896.73	95.87
Default	500	54.70	60877	54266	25312.62	125.84
EPD + API server	200	22.37	32055	22005	11214.79	68.46
Default	200	23.90	32055	22076.33	10518.36	85.03
EPD + API server	100	11.58	8122	10900	6844.65	28.94
Default	100	10.93	8122	10955.33	5377.11	45.80

As mentioned in the motivation section, the results demonstrate significant improvements in TPOT, for example median TPOT is improved by at least 30% across all workloads. Due to improvement in TPOT we can also observe improvements in ITL, especially in cases where the number of inputs is significantly larger than 1 batch size. However, the increased TTFT is also observed in some cases.

The elevated TTFT is primarily produced from two factors. First, image preprocessing occurs twice in the current implementation, creating redundant computational overhead. While this duplication introduces latency, the impact is partially mitigated through concurrent request processing to both encoder and prefill-decode instances. Second, the encoder cache transfer mechanism introduces delay in current two-step preallocation approach, where a preallocation notification is sent followed by a one iteration of model execution for the PD instance model runner, that will receive the encoder cache and return the injected IDs.

In the future the image preprocessing on PD or P instance can be removed in separated encode scenario. Also the ECConnector could be extended to utilize an intermediate caching server that stores encoder outputs and enables immediate cache delivery, eliminating the synchronization wait period without changing the ECConnector architecture. Also, adapting the AsyncScheduler feature could provide asynchronous coordination between the encoder and prefill-decode stages, reducing the communication overhead.

Other reasons can produce this TTFT increasing, but most likely it can be handled by proper configuration/changes without architecture changes.

We also can note the general throughput improvements, with EPD achieving ~5% higher request throughput at 1000 prompts workload (10.12 vs 9.58 req/s) and output token throughput (1078 vs 1023 tok/s). By resolving TTFT-related issues and modifying the vllm architecture towards disaggregated serving, we can continue obtaining better results in distributed setting, as the current implementation already slightly demonstrates the benefits while leaving clear optimization opportunities.

mergify · 2025-07-28T10:29:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fake0fan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant architectural change for disaggregating the vision encoder from the main model execution, which is a great step towards improving performance for multimodal models. The implementation is comprehensive, touching on scheduling, model execution, and communication between new services. My review focuses on ensuring the robustness and correctness of the new components, particularly around error handling, concurrency, and resource management. I've identified a few critical issues related to hardcoded values and error handling that could impact stability and correctness in a production environment. Addressing these will make the new architecture more resilient.

examples/online_serving/encoder_separation/parallel_separated_encoder_request.py

vllm/separated_encoder/encoder_cache_transfer/ec_connector.py

gemini-code-assist · 2025-07-28T10:30:35Z

vllm/separated_encoder/encoder_cache_transfer/ec_connector.py

The _recv_encoder_cache method currently converts the received numpy array to a torch tensor and moves it to a hardcoded device cuda:0. This is not flexible and will fail in multi-GPU or non-standard device configurations. This logic should be moved to the caller, which is aware of the correct device. The connector should be responsible only for data transfer, not device placement.

req_id, input_id, pos_info_dict, encoder_cache_numpy = \ pickle.loads(pickled_data) pos_info = dict_to_pos_info(pos_info_dict) callback(req_id, input_id, pos_info, encoder_cache_numpy)

vllm/v1/worker/gpu_model_runner.py

vllm/separated_encoder/encoder_cache_transfer/broker.py

github-actions · 2025-07-28T10:38:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-08-03T07:59:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fake0fan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant architectural change to disaggregate the MLLM encoder, which is a crucial step for flexible and efficient distributed inference. The implementation is comprehensive, with changes to the scheduler, model runner, and new components like connectors and wrappers. The documentation is thorough. My review has identified several high-severity issues that could affect the functionality and stability of the new feature. These include bugs in the example launch scripts, a fragile data parsing method in the Redis connector, a potential memory leak in the preallocator, and a concurrency issue in the scheduler's preallocation logic. Addressing these points will improve the robustness of this new architecture.

gemini-code-assist · 2025-08-22T06:47:45Z

examples/online_serving/separated_encode/launch_epd_api_server.sh

+        until curl -s localhost:$port/v1/chat/completions > /dev/null; do
+            sleep 1
+        done" && return 0 || return 1


The wait_for_server function attempts to check server readiness by sending a GET request to /v1/chat/completions. However, this endpoint in api_server_1e1pd.py only supports POST requests. A GET request from curl will result in a 405 Method Not Allowed error, causing the script to wait for the full timeout period before failing. You should use an endpoint that supports GET requests, such as /health, for the health check. This same issue exists in launch_epd_serve.sh and launch_epd_serve_separated.sh.

Suggested change

until curl -s localhost:$port/v1/chat/completions > /dev/null; do

sleep 1

done" && return 0 || return 1

until curl -s localhost:$port/health > /dev/null; do

After the server starts, the client will receive a 405 response, and the script will run properly

gemini-code-assist · 2025-08-22T06:47:45Z

examples/online_serving/separated_encode/launch_epd_serve.sh

+    timeout 12000 bash -c "
+        until curl -s localhost:$port/v1/chat/completions > /dev/null; do
+            sleep 1


The wait_for_server function attempts to check server readiness by sending a GET request to /v1/chat/completions. However, this endpoint in the vLLM OpenAI-compatible server only supports POST requests. A GET request from curl will result in a 405 Method Not Allowed error, causing the script to wait for the full timeout period before failing. You should use an endpoint that supports GET requests, such as /health, for the health check.

Suggested change

timeout 12000 bash -c "

until curl -s localhost:$port/v1/chat/completions > /dev/null; do

sleep 1

until curl -s localhost:$port/health > /dev/null; do

gemini-code-assist · 2025-08-22T06:47:45Z

examples/online_serving/separated_encode/launch_epd_serve_separated.sh

+    timeout 12000 bash -c "
+        until curl -s localhost:$port/v1/chat/completions > /dev/null; do
+            sleep 1


The wait_for_server function attempts to check server readiness by sending a GET request to /v1/chat/completions. However, this endpoint in the vLLM OpenAI-compatible server only supports POST requests. A GET request from curl will result in a 405 Method Not Allowed error, causing the script to wait for the full timeout period before failing. You should use an endpoint that supports GET requests, such as /health, for the health check.

Suggested change

timeout 12000 bash -c "

until curl -s localhost:$port/v1/chat/completions > /dev/null; do

sleep 1

until curl -s localhost:$port/health > /dev/null; do

gemini-code-assist · 2025-08-22T06:47:45Z

vllm/separated_encode/ec_transfer/connector/redis.py

+    def _get_request_ranks(self, request_id: str):
+        # request_id format: $ACTUAL_REQUEST_ID|$E_RANK|$PD_RANK
+        result = request_id.split("|")
+        return int(result[1]), int(result[2])


The _get_request_ranks method parses the request ID by splitting it with "|". This approach is fragile because if the original request_id contains a | character, the parsing will fail and lead to incorrect behavior or errors. For a core component like this, a more robust serialization method, such as JSON, should be used to encapsulate the request ID and rank information to avoid such parsing issues. For example, the payload could be a JSON string like {"req_id": "...", "e_rank": 0, "pd_rank": 1}.

The approach is similar to the xPyD disagg code, where the ZMQ addresses are passed via the request ID. Since the request ID is generated on the provided proxy, the code will function properly.

gemini-code-assist · 2025-08-22T06:47:45Z

vllm/separated_encode/sched/encoder_cache_preallocator.py

+            for _ in range(self.mm_inputs_total[request.request_id]):
+                # Clean ignored_preallocs later, currently we assume that
+                # all mm_inputs will come to the instance at some moment
+                if (request.request_id, _) in self.received_metas_reqs:
+                    self.received_metas_reqs.remove((request.request_id, _))
+                    continue
+                self.ignored_preallocs.add((request.request_id, _))


The comment on line 87, "Clean ignored_preallocs later...", indicates a known issue where ignored_preallocs are not properly cleaned up if a request is finished before all its multimodal inputs are processed. This can lead to a memory leak, as the ignored_preallocs set will grow indefinitely in a long-running server, consuming more and more memory. This should be addressed to ensure the stability of the service.

Currently, the Encoder Scheduler schedules all request's multimodal inputs together. As long as the request is scheduled on the E instance, we will clean the ignored_preallocs on the PD instance properly.

The only concern arises when the request is aborted on the E instance before being scheduled, but is completed on the PD instance through the cache, need to handle this.

Fixed, if a request is aborted before being scheduled, we will not send the request to the PD instance from our proxy

gemini-code-assist · 2025-08-22T06:47:46Z

vllm/v1/core/sched/scheduler.py

+        if self.mutex.locked():
+            return
+        with self.mutex:
+            while not self.ec_preallocator.is_empty():
+                prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(
+                    self.encoder_cache_manager.num_free_slots, fill_next = True)         
+                if not prealloc: # can't preallocate
+                    return
+                if candidate is not None:
+                    self.encoder_cache_manager.preallocate(*candidate)
+            # last element
+
+            prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(
+                self.encoder_cache_manager.num_free_slots, fill_next = False)
+            if (candidate is not None):
+                self.encoder_cache_manager.preallocate(*candidate)


The _perform_preallocations method starts with if self.mutex.locked(): return. This non-blocking check can cause preallocations to be skipped if the lock is contended, potentially leading to performance degradation or starvation of preallocation tasks. Since the scheduler's main loop is single-threaded, this lock is likely intended to protect against concurrent access from background threads (e.g., connector callbacks). A standard blocking with self.mutex: at the beginning of the method would be safer and prevent re-entrancy without the risk of starvation.

Suggested change

if self.mutex.locked():

return

with self.mutex:

while not self.ec_preallocator.is_empty():

prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(

self.encoder_cache_manager.num_free_slots, fill_next = True)

if not prealloc: # can't preallocate

return

if candidate is not None:

self.encoder_cache_manager.preallocate(*candidate)

# last element

prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(

self.encoder_cache_manager.num_free_slots, fill_next = False)

if (candidate is not None):

self.encoder_cache_manager.preallocate(*candidate)

with self.mutex:

while not self.ec_preallocator.is_empty():

prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(

self.encoder_cache_manager.num_free_slots, fill_next = True)

if not prealloc: # can't preallocate

return

if candidate is not None:

self.encoder_cache_manager.preallocate(*candidate)

# last element

prealloc, candidate = self.ec_preallocator.get_prealloc_candidate(

self.encoder_cache_manager.num_free_slots, fill_next = False)

if (candidate is not None):

self.encoder_cache_manager.preallocate(*candidate)

Currently, we use only a synchronized encoder cache preallocator, so the lock here has no effect.

In the future in the asynchronous scenario, there is no need to run the function if the function is already being executed in some thread, since the function obtains preallocations from the "global" queue, . Therefore, we can safely skip it without waiting for the lock if the lock is already locked.

Maybe it will be useful to add a limit on the number of preallocations per perform_preallocations call, because currently, this function executes until there are either no new preallocations or no available slots in the encoder cache manager.

mergify · 2025-08-26T02:43:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fake0fan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LastZhabka · 2025-08-28T14:02:41Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant architectural change to support Encode-Prefill-Decode (EPD) disaggregation for multimodal models. The changes are extensive and well-structured, introducing encoder separation, inter-instance communication for cache transfer, and modifications to the scheduler and model runner. The implementation correctly uses wrappers to minimize core logic changes, which is great for maintainability. I've found one critical issue related to a potential deadlock in the scheduling logic that needs to be addressed. Otherwise, this is a very impressive and well-documented contribution.

gemini-code-assist · 2025-08-28T14:08:01Z

vllm/separated_encode/sched/encoder_cache_preallocator.py

+    def get_prealloc_candidate(
+        self, free_space: int, fill_next: bool
+    ) -> tuple[bool, tuple[str, int, int, str] | None]:
+        """Validate the preallocation candidate, fill the next preallocation
+        candidate
+
+        Validate current preallocation candidate, retrieves the next 
+        preallocation request from the queue. Skips ignored preallocations
+        and checks whether prellocated data will fit in space constraints.
+
+        Args:
+            free_space: Available cache space in encoder tokens.
+            fill_next: Whether to fetch the next candidate after processing.
+
+        Returns:
+            Tuple of (should_continue, candidate_data) where:
+            - should_continue: True if caller should continue preallocations, 
+                               False if caller should stop.
+            - candidate_data: None or tuple of (request_id, input_id, 
+              num_encoder_tokens, mm_hash)
+        """
+        with self.scheduling_lock:
+            if self.prealloc_candidate is None:
+                if fill_next is True:
+                    self.prealloc_candidate = self.preallocs_queue.get()
+                return (True, None) # No candidate, just get next candidate
+
+            (request_id, input_id, num_encoder_tokens, mm_hash) = \
+                self.prealloc_candidate
+            if num_encoder_tokens > free_space:
+                return (False, None)
+
+            if fill_next is True:
+                self.prealloc_candidate = self.preallocs_queue.get()
+            else:
+                self.prealloc_candidate = None
+
+            if (request_id, input_id) in self.ignored_preallocs:
+                self.ignored_preallocs.remove((request_id, input_id))
+                return (True, None) # Skip and get next
+
+            self.pending_preallocs[request_id].remove(input_id)
+
+            return (True, (request_id, input_id, num_encoder_tokens, mm_hash))


The use of self.preallocs_queue.get() in this function is blocking. If the queue becomes empty while the scheduler's _perform_preallocations loop is running, this call will block indefinitely, causing the scheduler to hang. The while not self.ec_preallocator.is_empty() check in the scheduler is not sufficient to prevent this due to the pre-fetching logic within this function.

To prevent this potential deadlock, you should use a non-blocking get_nowait() and handle the queue.Empty exception.

def get_prealloc_candidate( self, free_space: int, fill_next: bool ) -> tuple[bool, tuple[str, int, int, str] | None]: """Validate the preallocation candidate, fill the next preallocation candidate Validate current preallocation candidate, retrieves the next preallocation request from the queue. Skips ignored preallocations and checks whether prellocated data will fit in space constraints. Args: free_space: Available cache space in encoder tokens. fill_next: Whether to fetch the next candidate after processing. Returns: Tuple of (should_continue, candidate_data) where: - should_continue: True if caller should continue preallocations, False if caller should stop. - candidate_data: None or tuple of (request_id, input_id, num_encoder_tokens, mm_hash) """ with self.scheduling_lock: if self.prealloc_candidate is None: if fill_next is True: try: self.prealloc_candidate = self.preallocs_queue.get_nowait() except queue.Empty: return (True, None) # No candidate, queue is empty else: return (True, None) # No candidate, not filling (request_id, input_id, num_encoder_tokens, mm_hash) = \ self.prealloc_candidate if num_encoder_tokens > free_space: return (False, None) if fill_next is True: try: self.prealloc_candidate = self.preallocs_queue.get_nowait() except queue.Empty: self.prealloc_candidate = None else: self.prealloc_candidate = None if (request_id, input_id) in self.ignored_preallocs: self.ignored_preallocs.remove((request_id, input_id)) return (True, None) # Skip and get next self.pending_preallocs[request_id].remove(input_id) return (True, (request_id, input_id, num_encoder_tokens, mm_hash))

The while not self.ec_preallocator.is_empty() check in the scheduler is not sufficient to prevent this due to the pre-fetching logic within this function.

fill_next is true if and only if the queue is not empty, if queue is empty we pass fill_next = false, that will prevent the queue.get() function from execution

DarkLight1337 · 2025-08-28T14:08:20Z

vllm/v1/core/encoder_cache_manager.py

        # Cached but currently not referenced by any request
-        if not self.cached[mm_hash]:
+        # If mm_hash is in preallocated then it will not be in freeable 
+        if not self.cached[mm_hash] and (not mm_hash in self.preallocated):


Suggested change

if not self.cached[mm_hash] and (not mm_hash in self.preallocated):

if not self.cached[mm_hash] and mm_hash not in self.preallocated:

DarkLight1337 · 2025-08-28T14:08:33Z

vllm/v1/core/encoder_cache_manager.py

        # mm_hash of mm_data => ids of requests that reference the mm_data
        self.cached: dict[str, set[str]] = {}
+
+        # Blockign 


Suggested change

# Blockign

# Blocking

DarkLight1337 · 2025-08-28T14:08:45Z

vllm/v1/core/encoder_cache_manager.py

            return
        self.cached[mm_hash].discard(req_id)
-        if not self.cached[mm_hash]:
+        if (not self.cached[mm_hash]) and (mm_hash not in self.preallocated):


DarkLight1337 · 2025-08-28T14:09:05Z

vllm/v1/core/sched/scheduler.py

        # for these models.
        self.encoder_cache_manager = EncoderCacheManager(
-            cache_size=encoder_cache_size)
+            cache_size=encoder_cache_size*10)


This seems a bit arbitrary

It's a temporary value to indicate that we need to add a method for determining an appropriate encoder cache size, I think we can replace this change by some comment

Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

[Misc] Change GPUModelRunnerWrapper names [Misc] Fix semaphore related risks [Core] Rollback to common encoder cache lifecycle on E instance scheduler Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

…he manager [Docs] Docstrings for all new methods provided by EPD disaggregation update [Docs] Update EPD disaggregation README.md [Feat] EPD disaggregation multiple instance [Feat] RedisECConnector, Redis Server configuration through env vars Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

LastZhabka · 2025-08-29T15:25:43Z

/gemini review

gemini-code-assist

Code Review

This is an impressive and significant contribution that introduces a disaggregated architecture for MLLM inference by separating the visual encoder. The design is well-documented with clear diagrams and explanations, which is crucial for such a complex change. The implementation appears robust, with careful consideration for concurrency and state management across distributed instances. I've found one critical issue related to potential deadlocking in the threading logic of the ECConnectorTemplate. Addressing this will be important for the stability of the new system. Overall, this is a very well-executed piece of work that lays a strong foundation for more flexible and efficient MLLM serving.

gemini-code-assist · 2025-08-29T15:28:54Z

vllm/separated_encode/ec_transfer/connector/template.py

+    def _limiting_wrapper(self, callback: Callable, arg: Callable):
+        """Wrapper function to limit the number of workers """
+        with self.limiting_semaphore:
+            callback(arg)
+
+    def _recv_event_loop(self, ):
+        """Run receive event loop 
+        
+        This method runs event loop for receive tasks and ensures that 
+        the number of requested parallel receives is limited by
+        $max_connector_workers.
+        """
+        try:
+            if self.target_recv_callback[0] is None:
+                return
+            while True:
+                callback, arg = self.target_recv_callback
+                with self.limiting_semaphore:
+                    self.recv_executors.submit(self._limiting_wrapper,
+                                               callback, arg)
+        except Exception as e:
+            raise ConnectionError("Error during recv event loop") from e


The current implementation of _recv_event_loop and _limiting_wrapper can lead to a deadlock. The _recv_event_loop acquires the semaphore and then submits a task (_limiting_wrapper) which also attempts to acquire the same semaphore. Since threading.Semaphore is not re-entrant, if the semaphore's value reaches zero, both the main loop thread and the worker threads will block indefinitely, waiting for each other to release the semaphore.

To fix this, I suggest refactoring the logic to a more standard producer-consumer pattern with a semaphore. The main loop should acquire the semaphore to block when all workers are busy, and each worker task should release the semaphore upon completion. This ensures that the number of concurrent receive operations is correctly limited without causing a deadlock.

def _worker_task(self, callback: Callable, arg: Callable): """Wrapper function to release semaphore after task completion.""" try: callback(arg) finally: self.limiting_semaphore.release() def _recv_event_loop(self, ): """Run receive event loop This method runs event loop for receive tasks and ensures that the number of requested parallel receives is limited by $max_connector_workers. """ try: if self.target_recv_callback[0] is None: return while True: callback, arg = self.target_recv_callback self.limiting_semaphore.acquire() self.recv_executors.submit(self._worker_task, callback, arg) except Exception as e: raise ConnectionError("Error during recv event loop") from e

NickLucche · 2025-09-01T15:25:54Z

Hey, thanks for the mega job here @fake0fan !

I have a few questions/suggestions since I believe this PR might be going a little bit in an orthogonal direction wrt work on other connectors.

First of all, if Redis is to be introduced in the picture as a core component, why shouldn't we change the EncoderCache to be an actual distributed cache, and use that to store/read tensors?
I feel the current pre-allocation method is too cumbersome when there's already a level of indirection between E and PD with Redis, while it would be required if p2p transfer was happening. With a distributed cache, we would just write and read when a mm request is sent.
Side note, we're still paying the device-host-host-device trip + non-IB transfer cost, but I think the cache is small enough st we don't have to worry about it.

Some other design points in no particular order:

I agree the separate runner for the encoder is needed
- downside, other platforms may need to reimplement it (eg tpu)
I also agree with a separate scheduler for the encoder
- downside, both changes will require additional burden to sync and maintain when core changes which is not ideal.
Pretty sure DisaggPrefillDecodeGPURunnerWrapper should be Mixin
I surely must've missed it, but where are we setting E to load only the encoder model and PD to load only the language_model?

LastZhabka · 2025-09-02T04:54:24Z

Hi @NickLucche,

I surely must've missed it, but where are we setting E to load only the encoder model and PD to load only the language_model?

During implementation, I wasn’t sure if all models were initialized in the same way, both the PD and E instances are currently loading both the encoder and language model.

I feel the current pre-allocation method is too cumbersome when there's already a level of indirection between E and PD with Redis, while it would be required if p2p transfer was happening. With a distributed cache, we would just write and read when a mm request is sent.

I designed system to align more with the current PD disaggregation implementation in vLLM, to support P2P and direct transfer using the same tech stack (ZMQ + NCCL).

For the distributed cache, we can reuse the same communication model (with some parts skipped) and use the server with the cache instead of the E instance (?)

fake0fan · 2025-09-04T03:41:31Z

Hi, @NickLucche, Thank you for your valuable feedback.

First of all, if Redis is to be introduced in the picture as a core component, why shouldn't we change the EncoderCache to be an actual distributed cache, and use that to store/read tensors? I feel the current pre-allocation method is too cumbersome when there's already a level of indirection between E and PD with Redis, while it would be required if p2p transfer was happening. With a distributed cache, we would just write and read when a mm request is sent. Side note, we're still paying the device-host-host-device trip + non-IB transfer cost, but I think the cache is small enough st we don't have to worry about it.

We have also noticed the issues you mentioned, so we plan to reference the current kv connector design overall and refactor a similar ec_connector design to reduce the corresponding complexity in scheduling logic and support for different storage and transmission methods.

Some other design points in no particular order:

I agree the separate runner for the encoder is needed

downside, other platforms may need to reimplement it (eg tpu)

We will also consider support for different platforms . Our current idea is to first support a distributed storage on host memory, which can be relatively independent from support for different computing hardware (GPU, TPU, NPU). Of course, we will also decide whether to implement some hardware-specific optimizations (such as P2P direct transfer, etc.) based on performance results later on.

I also agree with a separate scheduler for the encoder

downside, both changes will require additional burden to sync and maintain when core changes which is not ideal.

Agreed. So we might still want to keep the scheduler integrated rather than separating it out, similar to kv, mainly adding related logic to the scheduler to support EPD.

I surely must've missed it, but where are we setting E to load only the encoder model and PD to load only the language_model?

The current implementation indeed hasn't implemented this optimization. We will implement Encoder loading only the encoder model later on, but for PD I'm still considering whether, if preemption occurs, the PD instance might still need the Encoder model to perform recomputation, or whether it could directly read the corresponding Encoder embedding data through a distributed storage. This may require further discussion.

Overall, this is roughly our current situation. I'm wondering if we need to open a Google doc to introduce more discussion?

cc @ywang96

NickLucche · 2025-09-09T09:57:42Z

Hey @fake0fan , apologies for the delay, I was off the last few days.
Thanks a lot for your work and for clarifying the design.

I'm wondering if we need to open a Google doc to introduce more discussion

I also believe we could benefit from the "RFC path", opening up a doc where we can discuss different design options (example https://docs.google.com/document/d/10jhCNxJYvsUhtMtiMAaW2MxU5LU8HVje2pGDnj49gH4/edit?tab=t.0#heading=h.wsk0hlrf3cp2).
We should be able to circle back on this quickly given the amount of specifications you already provided, but at least we could list some of the points that emerged from this review at the top.

fake0fan · 2025-09-11T10:27:52Z

Hey @fake0fan , apologies for the delay, I was off the last few days. Thanks a lot for your work and for clarifying the design.

I'm wondering if we need to open a Google doc to introduce more discussion

I also believe we could benefit from the "RFC path", opening up a doc where we can discuss different design options (example https://docs.google.com/document/d/10jhCNxJYvsUhtMtiMAaW2MxU5LU8HVje2pGDnj49gH4/edit?tab=t.0#heading=h.wsk0hlrf3cp2). We should be able to circle back on this quickly given the amount of specifications you already provided, but at least we could list some of the points that emerged from this review at the top.

right, that sounds great. I've opened a Google Doc and briefly described some of our current design ideas. We can also discuss more design choices for EPD in it later.

vLLM Encode Prefill Decode Disaggregation Design

NickLucche · 2025-09-11T10:42:40Z

Thansk for doing this! Let me share it with more interested people too. I am confident we can iterate quickly.

fake0fan requested review from WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners July 28, 2025 10:28

mergify bot added the documentation Improvements or additions to documentation label Jul 28, 2025

mergify bot added v1 needs-rebase labels Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

fake0fan force-pushed the cgzheng_epd branch from eb23ac8 to 824dce8 Compare July 31, 2025 02:43

mergify bot removed the needs-rebase label Jul 31, 2025

fake0fan changed the title ~~[Core][WIP] Encoder separation for Encode-Prefill-Decode Disaggregation~~ [Core] Encoder separation for Encode-Prefill-Decode Disaggregation Jul 31, 2025

fake0fan marked this pull request as draft July 31, 2025 02:48

fake0fan marked this pull request as ready for review July 31, 2025 02:58

mergify bot added the needs-rebase label Aug 3, 2025

DarkLight1337 mentioned this pull request Aug 4, 2025

[Core] Store only the keys for multi-modal data in P0 #22198

Merged

4 tasks

fake0fan force-pushed the cgzheng_epd branch from 824dce8 to 6fe1293 Compare August 14, 2025 09:45

fake0fan requested review from ProExpertProg and yewentao256 as code owners August 14, 2025 09:45

mergify bot removed the needs-rebase label Aug 14, 2025

gemini-code-assist bot reviewed Aug 22, 2025

View reviewed changes

mergify bot added the needs-rebase label Aug 26, 2025

LastZhabka force-pushed the cgzheng_epd branch from 61a60a7 to ed61cc7 Compare August 28, 2025 13:58

mergify bot removed the needs-rebase label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

DarkLight1337 reviewed Aug 28, 2025

View reviewed changes

LastZhabka added 4 commits August 29, 2025 10:42

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation

8869edb

Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

[Doc] Encoder separation documentation and experimental results

a3093cc

Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

[Misc] Remove AsyncEncoderCachePreallocator

649d9db

[Misc] Change GPUModelRunnerWrapper names [Misc] Fix semaphore related risks [Core] Rollback to common encoder cache lifecycle on E instance scheduler Signed-off-by: LastZhabka <sakhmoldin.mukhammadarif@gmail.com>

LastZhabka force-pushed the cgzheng_epd branch from ed61cc7 to f7ee59c Compare August 29, 2025 03:25

gemini-code-assist bot reviewed Aug 29, 2025

View reviewed changes

happyandslow mentioned this pull request Aug 30, 2025

[RFC]: Supporting multi-modality models in AIBrix vllm-project/aibrix#1509

Closed

18 tasks

DarkLight1337 mentioned this pull request Sep 8, 2025

[Core] Enable HF processing on GPU #22070

Draft

4 tasks

fake0fan closed this Sep 19, 2025

fake0fan mentioned this pull request Sep 19, 2025

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation #25233

Open

This was referenced Oct 23, 2025

[Merge]PR-25233 From Original vLLM repo by fake0fan hsliuustc0106/vllm#97

Closed

Merge epd feature from Dr.zheng's repo into private repo hsliuustc0106/vllm#103

Closed

	if not self.cached[mm_hash] and (not mm_hash in self.preallocated):
	if not self.cached[mm_hash] and mm_hash not in self.preallocated:

Uh oh!

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation #21740

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation #21740

Conversation

fake0fan commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MLLM Encode separation and E-P Encoder Cache Transfer

Motivation

Overall Process

Implementation

vLLM minor changes

EPD Disaggregation Configuration

Additional ModelRunnerOutput Data Fields

Model Runner Wrapper Integration in GPUWorker

GPU Model Runner sanity check in encoder execution

Major Changes

EncoderCacheManager new Allocation Logic

EncoderCachePreallocator

Encoder Scheduler

Encoder Scheduler (encode)

Main Scheduler (prefill and prefill+decode instances)

Instance-Specific Model Runner Wrappers

DisaggVModelGPURunnerWrapper (Encode Instance)

DisaggLModelGPURunnerWrapper (Prefill/(Prefill+Decode) Instance)

Encoder Cache Connector

Extension Example

Usage Instructions

Benchmark

Uh oh!

mergify bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

mergify bot commented Aug 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

fake0fan commented Jul 28, 2025 •

edited by github-actions bot

Loading

LastZhabka Aug 22, 2025 •

edited

Loading

LastZhabka Aug 29, 2025 •

edited

Loading