-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation #25233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
49d76a1
00d1185
3031ea0
c2ea7db
7448888
bc00658
946bfb4
e1ecb30
d20c0f8
a290dda
668125d
62d4a2a
a0735ce
bc1f9ba
bcf5286
250d9e3
f5ba236
c5514d6
f470d17
756663c
3613ecc
23c4902
7ef47ad
1b79908
e4c9174
7ba00e3
99b34e3
dc9e2a9
2169531
a2221ea
a08d1d5
ef73fd8
3f86218
ff050c7
1e4c7ad
7968f61
ec4e701
3cc166f
67b502e
84d99c7
a5fc6fe
7ecb6f5
30888ba
90d40fe
eabfe7e
5120f76
7e7a269
37545ef
f1ee67c
a8f5dad
45d7724
42f43fb
c3050fc
12880b9
563cdd4
92a5222
bda8f49
1c76328
d772f81
cc31385
604f94d
3e75764
5e79b0b
2665ff6
8e04ca9
2353348
69515d8
cabee08
9f71ddf
1659fc4
66efd95
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # Disaggregated Encoder | ||
|
|
||
| A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits: | ||
|
|
||
| 1. **Independent, fine-grained scaling** | ||
| 2. **Lower time-to-first-token (TTFT)** | ||
| 3. **Cross-process reuse and caching of encoder outputs** | ||
|
|
||
| Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE> | ||
|
|
||
| --- | ||
|
|
||
| ## 1 Motivation | ||
|
|
||
| ### 1. Independent, fine-grained scaling | ||
|
|
||
| * Vision encoders are lightweight, while language models are orders of magnitude larger. | ||
| * The language model can be parallelised without affecting the encoder fleet. | ||
| * Encoder nodes can be added or removed independently. | ||
|
|
||
| ### 2. Lower time-to-first-token (TTFT) | ||
|
|
||
| * Language-only requests bypass the vision encoder entirely. | ||
| * Encoder output is injected only at required attention layers, shortening the pre-fill critical path. | ||
|
|
||
| ### 3. Cross-process reuse and caching | ||
|
|
||
| * In-process encoders confine reuse to a single worker. | ||
| * A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation. | ||
|
|
||
| --- | ||
|
|
||
| ## 2 Usage Example | ||
|
|
||
| The current reference pathway is **SharedStorageConnector**. | ||
| Below ready-to-run scripts shows the workflow: | ||
|
|
||
| 1 Encoder instance + 1 PD instance: | ||
| `examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh` | ||
|
|
||
| 1 Encoder instance + 1 Prefill instance + 1 Decode instance: | ||
| `examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh` | ||
|
|
||
| --- | ||
|
|
||
| ## 3 Test Script | ||
|
|
||
| Please refer to the directories `tests/v1/ec_connector` | ||
|
|
||
| ## 4 Development | ||
|
|
||
| Disaggregated encoding is implemented by running two parts: | ||
|
|
||
| * **Encoder instance** – a vLLM instance to performs vision encoding. | ||
| * **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode. | ||
| * PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D) | ||
|
|
||
| A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance. | ||
| All related code is under `vllm/distributed/ec_transfer`. | ||
|
|
||
| ### Key abstractions | ||
|
|
||
| * **ECConnector** – interface for retrieving EC caches produced by the encoder. | ||
| * *Scheduler role* – checks cache existence and schedules loads. | ||
| * *Worker role* – loads the embeddings into memory. | ||
|
|
||
| Here is a figure illustrating disaggregate encoder flow: | ||
|
|
||
|  | ||
|
|
||
| For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance. | ||
|
|
||
| `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0) | ||
|
|
||
| We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D; |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Disaggregated Encoder | ||
|
|
||
| This example contains scripts that demonstrate the disaggregated encoder (EPD) features of vLLM. | ||
|
|
||
| Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features. | ||
|
|
||
| ## Files | ||
|
|
||
| - `disagg_epd_proxy.py` - Proxy to demonstrates XeYpZd (X encode instances, Y prefill instances, Z decode instances); Currently stable for 1e1p1d. | ||
| - `disagg_1e1p1d_example.sh` - Setup 1e1p1d and run VisionArena benchmark. | ||
| - `disagg_1e1pd_example.sh` - Setup 1e1pd and run VisionArena benchmark. | ||
|
|
||
| Detailed explanations are commnented in the scripts. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| #!/bin/bash | ||
| set -euo pipefail | ||
|
|
||
| declare -a PIDS=() | ||
|
|
||
| ############################################################################### | ||
| # Configuration -- override via env before running | ||
| ############################################################################### | ||
| MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}" | ||
| LOG_PATH="${LOG_PATH:-./logs}" | ||
| mkdir -p $LOG_PATH | ||
|
|
||
| ENCODE_PORT="${ENCODE_PORT:-19534}" | ||
| PREFILL_PORT="${PREFILL_PORT:-19535}" | ||
| DECODE_PORT="${DECODE_PORT:-19536}" | ||
| PROXY_PORT="${PROXY_PORT:-10001}" | ||
|
|
||
| GPU_E="${GPU_E:-2}" | ||
| GPU_P="${GPU_P:-2}" | ||
| GPU_D="${GPU_D:-3}" | ||
|
|
||
| EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache}" | ||
| TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-12000}" # wait_for_server timeout | ||
|
|
||
| NUM_PROMPTS="${NUM_PROMPTS:-100}" # number of prompts to send in benchmark | ||
|
|
||
| export UCX_TLS=all | ||
| export UCX_NET_DEVICES=all | ||
|
|
||
| ############################################################################### | ||
| # Helpers | ||
| ############################################################################### | ||
| START_TIME=$(date +"%Y%m%d_%H%M%S") | ||
| ENC_LOG=$LOG_PATH/encoder_${START_TIME}.log | ||
| P_LOG=$LOG_PATH/p_${START_TIME}.log | ||
| D_LOG=$LOG_PATH/d_${START_TIME}.log | ||
| PROXY_LOG=$LOG_PATH/proxy_${START_TIME}.log | ||
|
|
||
| wait_for_server() { | ||
| local port=$1 | ||
| timeout "$TIMEOUT_SECONDS" bash -c " | ||
| until curl -s localhost:$port/v1/chat/completions > /dev/null; do | ||
| sleep 1 | ||
| done" && return 0 || return 1 | ||
| } | ||
|
|
||
| # Cleanup function | ||
| cleanup() { | ||
| echo "Stopping everything…" | ||
| trap - INT TERM USR1 # prevent re-entrancy | ||
|
|
||
| # Kill all tracked PIDs | ||
| for pid in "${PIDS[@]}"; do | ||
| if kill -0 "$pid" 2>/dev/null; then | ||
| echo "Killing process $pid" | ||
| kill "$pid" 2>/dev/null | ||
| fi | ||
| done | ||
|
|
||
| # Wait a moment for graceful shutdown | ||
| sleep 2 | ||
|
|
||
| # Force kill any remaining processes | ||
| for pid in "${PIDS[@]}"; do | ||
| if kill -0 "$pid" 2>/dev/null; then | ||
| echo "Force killing process $pid" | ||
| kill -9 "$pid" 2>/dev/null | ||
| fi | ||
| done | ||
|
|
||
| # Kill the entire process group as backup | ||
| kill -- -$$ 2>/dev/null | ||
|
|
||
| echo "All processes stopped." | ||
| exit 0 | ||
| } | ||
|
Comment on lines
+48
to
+76
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added PIDS=() at top of the script reffering to other example scripts |
||
|
|
||
| trap cleanup INT | ||
| trap cleanup USR1 | ||
| trap cleanup TERM | ||
|
|
||
| # clear previous cache | ||
| echo "remove previous ec cache folder" | ||
| rm -rf $EC_SHARED_STORAGE_PATH | ||
|
|
||
| echo "make ec cache folder" | ||
| mkdir -p $EC_SHARED_STORAGE_PATH | ||
|
|
||
| ############################################################################### | ||
| # Encoder worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.01 \ | ||
| --port "$ENCODE_PORT" \ | ||
| --enforce-eager \ | ||
| --enable-request-id-headers \ | ||
| --no-enable-prefix-caching \ | ||
| --max-num-batched-tokens 4096 \ | ||
| --max-num-seqs 128 \ | ||
| --ec-transfer-config '{ | ||
| "ec_connector": "ECSharedStorageConnector", | ||
| "ec_role": "ec_producer", | ||
| "ec_connector_extra_config": { | ||
| "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'" | ||
| } | ||
| }' \ | ||
| >"${ENC_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| ############################################################################### | ||
| # Prefill worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_P" \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \ | ||
| vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.7 \ | ||
| --port "$PREFILL_PORT" \ | ||
| --enforce-eager \ | ||
| --enable-request-id-headers \ | ||
| --max-num-seqs 128 \ | ||
| --ec-transfer-config '{ | ||
| "ec_connector": "ECSharedStorageConnector", | ||
| "ec_role": "ec_consumer", | ||
| "ec_connector_extra_config": { | ||
| "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'" | ||
| } | ||
| }' \ | ||
| --kv-transfer-config '{ | ||
| "kv_connector": "NixlConnector", | ||
| "kv_role": "kv_producer" | ||
| }' \ | ||
| >"${P_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| ############################################################################### | ||
| # Decode worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_D" \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \ | ||
| vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.7 \ | ||
| --port "$DECODE_PORT" \ | ||
| --enforce-eager \ | ||
| --enable-request-id-headers \ | ||
| --max-num-seqs 128 \ | ||
| --kv-transfer-config '{ | ||
| "kv_connector": "NixlConnector", | ||
| "kv_role": "kv_consumer" | ||
| }' \ | ||
| >"${D_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| # Wait for workers | ||
| wait_for_server $ENCODE_PORT | ||
| wait_for_server $PREFILL_PORT | ||
| wait_for_server $DECODE_PORT | ||
|
|
||
| ############################################################################### | ||
| # Proxy | ||
| ############################################################################### | ||
| python disagg_epd_proxy.py \ | ||
| --host "0.0.0.0" \ | ||
| --port "$PROXY_PORT" \ | ||
| --encode-servers-urls "http://localhost:$ENCODE_PORT" \ | ||
| --prefill-servers-urls "http://localhost:$PREFILL_PORT" \ | ||
| --decode-servers-urls "http://localhost:$DECODE_PORT" \ | ||
| >"${PROXY_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| wait_for_server $PROXY_PORT | ||
| echo "All services are up!" | ||
|
|
||
| ############################################################################### | ||
| # Benchmark | ||
| vllm bench serve \ | ||
| --model $MODEL \ | ||
| --backend openai-chat \ | ||
| --endpoint /v1/chat/completions \ | ||
| --dataset-name hf \ | ||
| --dataset-path lmarena-ai/VisionArena-Chat \ | ||
| --seed 0 \ | ||
| --num-prompts $NUM_PROMPTS \ | ||
| --port $PROXY_PORT | ||
|
|
||
| PIDS+=($!) | ||
| ############################################################################### | ||
|
|
||
| # cleanup | ||
| echo "cleanup..." | ||
| cleanup | ||
Uh oh!
There was an error while loading. Please reload this page.