Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
49d76a1
[Draft] EPD
Sep 8, 2025
00d1185
[Fix] Runable for non EPD mode
Sep 9, 2025
3031ea0
[Fix] Remove hard code path in share storage connector
Sep 9, 2025
c2ea7db
[Misc] Clean code and docs
Sep 9, 2025
7448888
[Misc] Clean code
Sep 9, 2025
bc00658
[Fix] Enable start Encoder Instance without KVC and fix PD instance s…
Sep 10, 2025
946bfb4
[Minor] Fix EC update state after encoder cache allocation
Sep 10, 2025
e1ecb30
[Bug] Check actualy tensor file for EC Cache exist in ECSharedStorage…
Sep 10, 2025
d20c0f8
[Bugfix] Encoder Instance does not return Output to client
Sep 10, 2025
a290dda
[Bugfix] Fix try scheduler encoder inputs return when there is no enc…
knlnguyen1802 Sep 11, 2025
668125d
[Fix] Fix typo and move all hardcode to launch config
knlnguyen1802 Sep 12, 2025
62d4a2a
[Fix] Fix typo in docs
knlnguyen1802 Sep 12, 2025
a0735ce
[Misc] Add example for disaggregate encoder
knlnguyen1802 Sep 15, 2025
bc1f9ba
[Fix] Clean code
knlnguyen1802 Sep 15, 2025
bcf5286
[Bugfix] Fix bug when request there is no encoder input
knlnguyen1802 Sep 15, 2025
250d9e3
[Misc] Optimize disaggregate encoder proxy
knlnguyen1802 Sep 16, 2025
f5ba236
[Minor] Remove some comments and update docs
knlnguyen1802 Sep 16, 2025
c5514d6
[Minor] Remove unused logging
knlnguyen1802 Sep 16, 2025
f470d17
[Misc] Add docs for disaggregate encoder
knlnguyen1802 Sep 16, 2025
756663c
[Feature] E+P+D disagg with proxy & example
herotai214 Sep 16, 2025
3613ecc
[Misc] Proxy support E+P+D & E+PD / Update docs & examples
herotai214 Sep 17, 2025
23c4902
Update vllm/config/__init__.py
knlnguyen1802 Sep 22, 2025
7ef47ad
[Misc] Resolve comments
herotai214 Oct 8, 2025
1b79908
Update examples/online_serving/disaggregated_encoder/README.md
knlnguyen1802 Oct 8, 2025
e4c9174
[Misc] resolve comment and fix example script
khuonglmhw Oct 8, 2025
7ba00e3
[Misc] Resolve comment
knlnguyen1802 Oct 9, 2025
99b34e3
[Misc] Resolve comment
knlnguyen1802 Oct 9, 2025
dc9e2a9
[CI/Build] Unit test for ECSharedStorageConnector
herotai214 Oct 10, 2025
2169531
[Fix] remove v1 check for create_connector and update docs
knlnguyen1802 Oct 10, 2025
a2221ea
[Misc] Clean code to pass pre-commit rules
khuonglmhw Oct 14, 2025
a08d1d5
[Bugfix] Fix ec_connector_output for non multimodal model
knlnguyen1802 Oct 17, 2025
ef73fd8
[Fix] Rebase
knlnguyen1802 Oct 21, 2025
3f86218
[Fix] Rebase
knlnguyen1802 Oct 21, 2025
ff050c7
Fix pre-commit
knlnguyen1802 Oct 21, 2025
1e4c7ad
Merge branch 'main' into epd_draft
knlnguyen1802 Oct 21, 2025
7968f61
[Misc] Fix/add documentation
khuonglmhw Oct 21, 2025
ec4e701
Fix docs
knlnguyen1802 Oct 22, 2025
3cc166f
Fix test and clean code
knlnguyen1802 Oct 22, 2025
67b502e
Fix docs
knlnguyen1802 Oct 22, 2025
84d99c7
Fix docs
knlnguyen1802 Oct 22, 2025
a5fc6fe
Fix docs again
knlnguyen1802 Oct 22, 2025
7ecb6f5
Merge branch 'main' into epd_draft
khuonglm Oct 22, 2025
30888ba
[Misc] Script to verify EPD correctness
herotai214 Oct 22, 2025
90d40fe
Merge branch 'epd_draft' of https://github.com/fake0fan/vllm into epd…
knlnguyen1802 Oct 23, 2025
eabfe7e
Fix pre-commit
knlnguyen1802 Oct 23, 2025
5120f76
Fix pre-commit
knlnguyen1802 Oct 23, 2025
7e7a269
Fix pre-commit
knlnguyen1802 Oct 23, 2025
37545ef
[Fix] fix pre-commit
khuonglm Oct 23, 2025
f1ee67c
Fix docs
knlnguyen1802 Oct 23, 2025
a8f5dad
Fix whisper bug
knlnguyen1802 Oct 24, 2025
45d7724
Rebase with main
knlnguyen1802 Oct 24, 2025
42f43fb
Fix pre-commit
knlnguyen1802 Oct 24, 2025
c3050fc
Merge branch 'main' into epd_draft
knlnguyen1802 Oct 24, 2025
12880b9
Rebase with main
knlnguyen1802 Oct 27, 2025
563cdd4
[CI/Build] Fix test_scheduler disable_hybrid_kv_cache_manager flag
herotai214 Oct 27, 2025
92a5222
Merge branch 'main' into epd_draft
knlnguyen1802 Oct 27, 2025
bda8f49
Merge branch 'main' into epd_draft
fake0fan Oct 27, 2025
1c76328
[CI/Build][Bugfix] Fix and add test ensure init ECConnector before KV…
herotai214 Oct 28, 2025
d772f81
Rebase with main
knlnguyen1802 Oct 30, 2025
cc31385
Merge branch 'main' into epd_draft
knlnguyen1802 Oct 30, 2025
604f94d
Merge branch 'main' into epd_draft
knlnguyen1802 Oct 30, 2025
3e75764
Merge branch 'main' into epd_draft & Add ec_connector_output to Exect…
herotai214 Nov 4, 2025
5e79b0b
Fix scheduler pre-commit
herotai214 Nov 4, 2025
2665ff6
[Bugfix] Fix num_tokens_to_schedule & Add unit test
herotai214 Nov 5, 2025
8e04ca9
Fix pre-commit
herotai214 Nov 6, 2025
2353348
[Bugfix] Add logs to EPD proxy at during runtime
herotai214 Nov 6, 2025
69515d8
Edit Readme & comments
herotai214 Nov 6, 2025
cabee08
Merge branch 'main' into epd_draft; Revert HMA related changes
herotai214 Nov 6, 2025
9f71ddf
fix pre-commit
herotai214 Nov 6, 2025
1659fc4
[Misc] Support local mm, Increase encoder budget in test & better err…
herotai214 Nov 7, 2025
66efd95
Merge branch 'main' into epd_draft; no conflict
herotai214 Nov 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
75 changes: 75 additions & 0 deletions docs/features/disagg_encoder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Disaggregated Encoder

A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:

1. **Independent, fine-grained scaling**
2. **Lower time-to-first-token (TTFT)**
3. **Cross-process reuse and caching of encoder outputs**

Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>

---

## 1 Motivation

### 1. Independent, fine-grained scaling

* Vision encoders are lightweight, while language models are orders of magnitude larger.
* The language model can be parallelised without affecting the encoder fleet.
* Encoder nodes can be added or removed independently.

### 2. Lower time-to-first-token (TTFT)

* Language-only requests bypass the vision encoder entirely.
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.

### 3. Cross-process reuse and caching

* In-process encoders confine reuse to a single worker.
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.

---

## 2 Usage Example

The current reference pathway is **SharedStorageConnector**.
Below ready-to-run scripts shows the workflow:

1 Encoder instance + 1 PD instance:
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh`

1 Encoder instance + 1 Prefill instance + 1 Decode instance:
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh`

---

## 3 Test Script

Please refer to the directories `tests/v1/ec_connector`

## 4 Development

Disaggregated encoding is implemented by running two parts:

* **Encoder instance** – a vLLM instance to performs vision encoding.
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
* PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)

A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under `vllm/distributed/ec_transfer`.

### Key abstractions

* **ECConnector** – interface for retrieving EC caches produced by the encoder.
* *Scheduler role* – checks cache existence and schedules loads.
* *Worker role* – loads the embeddings into memory.

Here is a figure illustrating disaggregate encoder flow:

![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)

For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.

`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)

We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;
13 changes: 13 additions & 0 deletions examples/online_serving/disaggregated_encoder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Disaggregated Encoder

This example contains scripts that demonstrate the disaggregated encoder (EPD) features of vLLM.

Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.

## Files

- `disagg_epd_proxy.py` - Proxy to demonstrates XeYpZd (X encode instances, Y prefill instances, Z decode instances); Currently stable for 1e1p1d.
- `disagg_1e1p1d_example.sh` - Setup 1e1p1d and run VisionArena benchmark.
- `disagg_1e1pd_example.sh` - Setup 1e1pd and run VisionArena benchmark.

Detailed explanations are commnented in the scripts.
196 changes: 196 additions & 0 deletions examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
#!/bin/bash
set -euo pipefail

declare -a PIDS=()

###############################################################################
# Configuration -- override via env before running
###############################################################################
MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}"
LOG_PATH="${LOG_PATH:-./logs}"
mkdir -p $LOG_PATH

ENCODE_PORT="${ENCODE_PORT:-19534}"
PREFILL_PORT="${PREFILL_PORT:-19535}"
DECODE_PORT="${DECODE_PORT:-19536}"
PROXY_PORT="${PROXY_PORT:-10001}"

GPU_E="${GPU_E:-2}"
GPU_P="${GPU_P:-2}"
GPU_D="${GPU_D:-3}"

EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache}"
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-12000}" # wait_for_server timeout

NUM_PROMPTS="${NUM_PROMPTS:-100}" # number of prompts to send in benchmark

export UCX_TLS=all
export UCX_NET_DEVICES=all

###############################################################################
# Helpers
###############################################################################
START_TIME=$(date +"%Y%m%d_%H%M%S")
ENC_LOG=$LOG_PATH/encoder_${START_TIME}.log
P_LOG=$LOG_PATH/p_${START_TIME}.log
D_LOG=$LOG_PATH/d_${START_TIME}.log
PROXY_LOG=$LOG_PATH/proxy_${START_TIME}.log

wait_for_server() {
local port=$1
timeout "$TIMEOUT_SECONDS" bash -c "
until curl -s localhost:$port/v1/chat/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}

# Cleanup function
cleanup() {
echo "Stopping everything…"
trap - INT TERM USR1 # prevent re-entrancy

# Kill all tracked PIDs
for pid in "${PIDS[@]}"; do
if kill -0 "$pid" 2>/dev/null; then
echo "Killing process $pid"
kill "$pid" 2>/dev/null
fi
done

# Wait a moment for graceful shutdown
sleep 2

# Force kill any remaining processes
for pid in "${PIDS[@]}"; do
if kill -0 "$pid" 2>/dev/null; then
echo "Force killing process $pid"
kill -9 "$pid" 2>/dev/null
fi
done

# Kill the entire process group as backup
kill -- -$$ 2>/dev/null

echo "All processes stopped."
exit 0
}
Comment on lines +48 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The PIDS array is used in the cleanup function and to track background processes, but it's not declared. With set -u active, this will cause an "unbound variable" error when the script tries to access it, making the script fail. Please declare it before its first use, for example: declare -a PIDS=().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added PIDS=() at top of the script reffering to other example scripts


trap cleanup INT
trap cleanup USR1
trap cleanup TERM

# clear previous cache
echo "remove previous ec cache folder"
rm -rf $EC_SHARED_STORAGE_PATH

echo "make ec cache folder"
mkdir -p $EC_SHARED_STORAGE_PATH

###############################################################################
# Encoder worker
###############################################################################
CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \
--gpu-memory-utilization 0.01 \
--port "$ENCODE_PORT" \
--enforce-eager \
--enable-request-id-headers \
--no-enable-prefix-caching \
--max-num-batched-tokens 4096 \
--max-num-seqs 128 \
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_producer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
>"${ENC_LOG}" 2>&1 &

PIDS+=($!)

###############################################################################
# Prefill worker
###############################################################################
CUDA_VISIBLE_DEVICES="$GPU_P" \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \
vllm serve "$MODEL" \
--gpu-memory-utilization 0.7 \
--port "$PREFILL_PORT" \
--enforce-eager \
--enable-request-id-headers \
--max-num-seqs 128 \
--ec-transfer-config '{
"ec_connector": "ECSharedStorageConnector",
"ec_role": "ec_consumer",
"ec_connector_extra_config": {
"shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
}
}' \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_producer"
}' \
>"${P_LOG}" 2>&1 &

PIDS+=($!)

###############################################################################
# Decode worker
###############################################################################
CUDA_VISIBLE_DEVICES="$GPU_D" \
UCX_NET_DEVICES=all \
VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \
vllm serve "$MODEL" \
--gpu-memory-utilization 0.7 \
--port "$DECODE_PORT" \
--enforce-eager \
--enable-request-id-headers \
--max-num-seqs 128 \
--kv-transfer-config '{
"kv_connector": "NixlConnector",
"kv_role": "kv_consumer"
}' \
>"${D_LOG}" 2>&1 &

PIDS+=($!)

# Wait for workers
wait_for_server $ENCODE_PORT
wait_for_server $PREFILL_PORT
wait_for_server $DECODE_PORT

###############################################################################
# Proxy
###############################################################################
python disagg_epd_proxy.py \
--host "0.0.0.0" \
--port "$PROXY_PORT" \
--encode-servers-urls "http://localhost:$ENCODE_PORT" \
--prefill-servers-urls "http://localhost:$PREFILL_PORT" \
--decode-servers-urls "http://localhost:$DECODE_PORT" \
>"${PROXY_LOG}" 2>&1 &

PIDS+=($!)

wait_for_server $PROXY_PORT
echo "All services are up!"

###############################################################################
# Benchmark
vllm bench serve \
--model $MODEL \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--seed 0 \
--num-prompts $NUM_PROMPTS \
--port $PROXY_PORT

PIDS+=($!)
###############################################################################

# cleanup
echo "cleanup..."
cleanup
Loading