vllm-project · fake0fan · Sep 8, 2025 · Sep 9, 2025 · Sep 9, 2025 · Sep 9, 2025
diff --git a/docs/assets/features/disagg_encoder/disagg_encoder_flow.png b/docs/assets/features/disagg_encoder/disagg_encoder_flow.png
diff --git a/docs/features/disagg_encoder.md b/docs/features/disagg_encoder.md
@@ -0,0 +1,75 @@
+# Disaggregated Encoder
+
+A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
+
+1. **Independent, fine-grained scaling**  
+2. **Lower time-to-first-token (TTFT)**  
+3. **Cross-process reuse and caching of encoder outputs**
+
+Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
+
+---
+
+## 1  Motivation
+
+### 1. Independent, fine-grained scaling
+
+* Vision encoders are lightweight, while language models are orders of magnitude larger.  
+* The language model can be parallelised without affecting the encoder fleet.  
+* Encoder nodes can be added or removed independently.
+
+### 2. Lower time-to-first-token (TTFT)
+
+* Language-only requests bypass the vision encoder entirely.  
+* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
+
+### 3. Cross-process reuse and caching
+
+* In-process encoders confine reuse to a single worker.  
+* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
+
+---
+
+## 2  Usage Example
+
+The current reference pathway is **SharedStorageConnector**.  
+Below ready-to-run scripts shows the workflow:
+
+1 Encoder instance + 1 PD instance:
+`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh`
+
+1 Encoder instance + 1 Prefill instance + 1 Decode instance:
+`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh`
+
+---
+
+## 3  Test Script
+
+Please refer to the directories `tests/v1/ec_connector`
+
+## 4  Development
+
+Disaggregated encoding is implemented by running two parts:
+
+* **Encoder instance** – a vLLM instance to performs vision encoding.  
+* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
+    * PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)
+
+A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.  
+All related code is under `vllm/distributed/ec_transfer`.
+
+### Key abstractions
+
+* **ECConnector** – interface for retrieving EC caches produced by the encoder.  
+    * *Scheduler role* – checks cache existence and schedules loads.  
+    * *Worker role* – loads the embeddings into memory.
+
+Here is a figure illustrating disaggregate encoder flow:
+
+![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)
+
+For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
+
+`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
+
+We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;
diff --git a/examples/online_serving/disaggregated_encoder/README.md b/examples/online_serving/disaggregated_encoder/README.md
@@ -0,0 +1,13 @@
+# Disaggregated Encoder
+
+This example contains scripts that demonstrate the disaggregated encoder (EPD) features of vLLM.
+
+Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.
+
+## Files
+
+- `disagg_epd_proxy.py` - Proxy to demonstrates XeYpZd (X encode instances, Y prefill instances, Z decode instances); Currently stable for 1e1p1d.
+- `disagg_1e1p1d_example.sh` - Setup 1e1p1d and run VisionArena benchmark.
+- `disagg_1e1pd_example.sh` - Setup 1e1pd and run VisionArena benchmark.
+
+Detailed explanations are commnented in the scripts.
diff --git a/examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh b/examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh
@@ -0,0 +1,196 @@
+#!/bin/bash
+set -euo pipefail
+
+declare -a PIDS=()
+
+###############################################################################
+# Configuration -- override via env before running
+###############################################################################
+MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}"
+LOG_PATH="${LOG_PATH:-./logs}"
+mkdir -p $LOG_PATH
+
+ENCODE_PORT="${ENCODE_PORT:-19534}"
+PREFILL_PORT="${PREFILL_PORT:-19535}"
+DECODE_PORT="${DECODE_PORT:-19536}"
+PROXY_PORT="${PROXY_PORT:-10001}"
+
+GPU_E="${GPU_E:-2}"
+GPU_P="${GPU_P:-2}"
+GPU_D="${GPU_D:-3}"
+
+EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache}"
+TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-12000}"   # wait_for_server timeout
+
+NUM_PROMPTS="${NUM_PROMPTS:-100}"    # number of prompts to send in benchmark
+
+export UCX_TLS=all
+export UCX_NET_DEVICES=all
+
+###############################################################################
+# Helpers
+###############################################################################
+START_TIME=$(date +"%Y%m%d_%H%M%S")
+ENC_LOG=$LOG_PATH/encoder_${START_TIME}.log
+P_LOG=$LOG_PATH/p_${START_TIME}.log
+D_LOG=$LOG_PATH/d_${START_TIME}.log
+PROXY_LOG=$LOG_PATH/proxy_${START_TIME}.log
+
+wait_for_server() {
+    local port=$1
+    timeout "$TIMEOUT_SECONDS" bash -c "
+        until curl -s localhost:$port/v1/chat/completions > /dev/null; do
+            sleep 1
+        done" && return 0 || return 1
+}
+
+# Cleanup function
+cleanup() {
+    echo "Stopping everything…"
+    trap - INT TERM USR1   # prevent re-entrancy
+
+    # Kill all tracked PIDs
+    for pid in "${PIDS[@]}"; do
+        if kill -0 "$pid" 2>/dev/null; then
+            echo "Killing process $pid"
+            kill "$pid" 2>/dev/null
+        fi
+    done
+
+    # Wait a moment for graceful shutdown
+    sleep 2
+
+    # Force kill any remaining processes
+    for pid in "${PIDS[@]}"; do
+        if kill -0 "$pid" 2>/dev/null; then
+            echo "Force killing process $pid"
+            kill -9 "$pid" 2>/dev/null
+        fi
+    done
+
+    # Kill the entire process group as backup
+    kill -- -$$ 2>/dev/null
+
+    echo "All processes stopped."
+    exit 0
+}
+
+trap cleanup INT
+trap cleanup USR1
+trap cleanup TERM
+
+# clear previous cache
+echo "remove previous ec cache folder"
+rm -rf $EC_SHARED_STORAGE_PATH
+
+echo "make ec cache folder"
+mkdir -p $EC_SHARED_STORAGE_PATH
+
+###############################################################################
+# Encoder worker
+###############################################################################
+CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
+    --gpu-memory-utilization 0.01 \
+    --port "$ENCODE_PORT" \
+    --enforce-eager \
+    --enable-request-id-headers \
+    --no-enable-prefix-caching \
+    --max-num-batched-tokens 4096 \
+    --max-num-seqs 128 \
+    --ec-transfer-config '{
+        "ec_connector": "ECSharedStorageConnector",
+        "ec_role": "ec_producer",
+        "ec_connector_extra_config": {
+            "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
+        }
+    }' \
+    >"${ENC_LOG}" 2>&1 &
+
+PIDS+=($!)
+
+###############################################################################
+# Prefill worker
+###############################################################################
+CUDA_VISIBLE_DEVICES="$GPU_P" \
+UCX_NET_DEVICES=all \
+VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
+vllm serve "$MODEL" \
+    --gpu-memory-utilization 0.7 \
+    --port "$PREFILL_PORT" \
+    --enforce-eager \
+    --enable-request-id-headers \
+    --max-num-seqs 128 \
+    --ec-transfer-config '{
+        "ec_connector": "ECSharedStorageConnector",
+        "ec_role": "ec_consumer",
+        "ec_connector_extra_config": {
+            "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
+        }
+    }' \
+    --kv-transfer-config '{
+        "kv_connector": "NixlConnector",
+        "kv_role": "kv_producer"
+    }' \
+    >"${P_LOG}" 2>&1 &
+
+PIDS+=($!)
+
+###############################################################################
+# Decode worker
+###############################################################################
+CUDA_VISIBLE_DEVICES="$GPU_D" \
+UCX_NET_DEVICES=all \
+VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
+vllm serve "$MODEL" \
+    --gpu-memory-utilization 0.7 \
+    --port "$DECODE_PORT" \
+    --enforce-eager \
+    --enable-request-id-headers \
+    --max-num-seqs 128 \
+    --kv-transfer-config '{
+        "kv_connector": "NixlConnector",
+        "kv_role": "kv_consumer"
+    }' \
+    >"${D_LOG}" 2>&1 &
+
+PIDS+=($!)
+
+# Wait for workers
+wait_for_server $ENCODE_PORT
+wait_for_server $PREFILL_PORT
+wait_for_server $DECODE_PORT
+
+###############################################################################
+# Proxy
+###############################################################################
+python disagg_epd_proxy.py \
+    --host "0.0.0.0" \
+    --port "$PROXY_PORT" \
+    --encode-servers-urls "http://localhost:$ENCODE_PORT" \
+    --prefill-servers-urls "http://localhost:$PREFILL_PORT" \
+    --decode-servers-urls "http://localhost:$DECODE_PORT" \
+    >"${PROXY_LOG}" 2>&1 &
+
+PIDS+=($!)
+
+wait_for_server $PROXY_PORT
+echo "All services are up!"
+
+###############################################################################
+# Benchmark
+vllm bench serve \
+  --model               $MODEL \
+  --backend             openai-chat \
+  --endpoint            /v1/chat/completions \
+  --dataset-name        hf \
+  --dataset-path        lmarena-ai/VisionArena-Chat \
+  --seed                0 \
+  --num-prompts         $NUM_PROMPTS \
+  --port                $PROXY_PORT
+
+PIDS+=($!)
+###############################################################################
+
+# cleanup
+echo "cleanup..."
+cleanup