Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
968b95e
proxy
Abatom Jun 25, 2025
b4d14e7
format
Abatom Jun 25, 2025
f71822b
_listen_for_requests
Abatom Jun 25, 2025
38f19f5
received
Abatom Jun 25, 2025
64a2113
recv_tensor
Abatom Jun 25, 2025
9b719e0
sent
Abatom Jun 25, 2025
3ed3829
zmq_address
Abatom Jun 25, 2025
a443708
send_queue_cv
Abatom Jun 25, 2025
f9ab67f
bugfix & format
Abatom Jun 25, 2025
0eaf48c
debug
Abatom Jun 26, 2025
f6b1c68
bugfix
Abatom Jun 26, 2025
f8b0cfc
bugfix
Abatom Jun 26, 2025
50bbc9b
add log
Abatom Jun 26, 2025
5bf5681
recv_request_id_to_tensor_ids
Abatom Jun 26, 2025
c46ec51
recv_request_id_to_tensor_ids
Abatom Jun 26, 2025
776a058
bugfix
Abatom Jun 26, 2025
d70614a
event.synchronize()
Abatom Jun 26, 2025
fae468d
rm with torch.cuda.stream(stream)
Abatom Jun 26, 2025
e28fa40
log level
Abatom Jun 26, 2025
fc16221
to(self.device)
Abatom Jun 26, 2025
edca394
current_stream()
Abatom Jun 26, 2025
40f4742
log level
Abatom Jun 26, 2025
178ff2c
log level
Abatom Jun 26, 2025
4999a42
add nvtx
Abatom Jun 27, 2025
9af341a
with torch.cuda.stream(stream)
Abatom Jun 27, 2025
41b8eba
mod proxy port
Abatom Jun 28, 2025
07d01cd
del PUT_ASYNC
Abatom Jun 28, 2025
82e5fac
_recv
Abatom Jun 28, 2025
1e31f15
format
Abatom Jun 28, 2025
f292872
remove nvtx
Abatom Jun 28, 2025
f724f58
update md
Abatom Jun 28, 2025
e5e585b
add log
Abatom Jun 30, 2025
9ff003f
bugfix
Abatom Jun 30, 2025
be38ac8
bugfix
Abatom Jun 30, 2025
3b51339
add log
Abatom Jun 30, 2025
479aa0d
finished_recving
Abatom Jun 30, 2025
0b35d02
format
Abatom Jun 30, 2025
349cb61
update md
Abatom Jun 30, 2025
db2a1e5
finished_recving
Abatom Jun 30, 2025
61c243a
discard
Abatom Jun 30, 2025
3a47595
bugfix
Abatom Jun 30, 2025
f4bba34
finished_recving_kv_req_ids
Abatom Jun 30, 2025
af0d84b
bugfix
Abatom Jun 30, 2025
7fa1f11
rm num_layers
Abatom Jun 30, 2025
c200e58
format
Abatom Jun 30, 2025
100026d
self.num_layers
Abatom Jul 1, 2025
f80dad5
rm _
Abatom Jul 1, 2025
e6d225d
rm _
Abatom Jul 1, 2025
ec5c434
Merge main
Abatom Jul 1, 2025
4fb5f85
bugfix
Abatom Jul 1, 2025
2c8f5c2
format
Abatom Jul 1, 2025
c284257
format
Abatom Jul 1, 2025
0a69f9a
torch.empty
Abatom Jul 2, 2025
f5a06ea
bugfix
Abatom Jul 2, 2025
53241bb
sched_yield
Abatom Jul 2, 2025
6d411a8
format
Abatom Jul 2, 2025
d0e432d
time.sleep(0)
Abatom Jul 2, 2025
81e7a80
add comments
Abatom Jul 2, 2025
2535da8
add comments
Abatom Jul 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions docs/design/v1/p2p_nccl_connector.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ As shown in Figure 1, the overall process of this **PD disaggregation** solution
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**.
3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**.
4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`.
4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT** mode). The D instance's `zmq_addr` can be resolved through the `request_id`.
5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**.
6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**.
7. After completing **Decode**, the D instance returns the result to the **Proxy/Router**, which then forwards it to the **client**.
Expand All @@ -31,9 +31,9 @@ Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (cur

## KV Cache Transfer Methods

There are three methods for KVcache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVcache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache.
There are three methods for KVcache transfer: PUT and GET. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. PUT involve the P instance actively sending KVcache to the D instance. PUT is an asynchronous transfer method. PUT uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache.

Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT → GET.

## P2P Communication via ZMQ & NCCL

Expand All @@ -53,7 +53,7 @@ Each NCCL group occupies a certain amount of GPU memory buffer for communication

## GPU Memory Buffer and Tensor Memory Pool

The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size.
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT mode, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size.

If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVcache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVcache loss. Once KVcache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance.

Expand All @@ -68,7 +68,7 @@ To address the above issues, I have designed and developed a local Tensor memory
cd /home

# Download the installation package, and I will update the commit-id in time. You can directly copy the command.
wget https://vllm-wheels.s3.us-west-2.amazonaws.com/9112b443a042d8d815880b8780633882ad32b183/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
wget https://vllm-wheels.s3.us-west-2.amazonaws.com/0d06b533a0fcca7a62603c868df68235659d6935/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

# Download the code repository.
git clone -b xpyd-v1 https://github.com/Abatom/vllm.git
Expand All @@ -88,9 +88,9 @@ To address the above issues, I have designed and developed a local Tensor memory
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
- You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict).
- `PUT_ASYNC` offers the best performance and should be prioritized.
- `PUT` offers the more performance and should be prioritized.
- The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`.
- The `disagg_prefill_proxy_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances).
- The `disagg_prefill_proxy_xpyd.py` script will use port 10101 (for receiving client requests) and port 30201 (for receiving service discovery from P and D instances).
- The node running the proxy must have `quart` installed.
- Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`.
- In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**.
Expand Down Expand Up @@ -123,7 +123,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20005","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
Expand All @@ -145,7 +145,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20009","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
Expand All @@ -167,7 +167,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20003","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
Expand All @@ -189,7 +189,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20008","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

## Run 3P1D
Expand Down Expand Up @@ -220,7 +220,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20005","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
Expand All @@ -242,7 +242,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20009","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
Expand All @@ -264,7 +264,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20003","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
Expand All @@ -286,13 +286,13 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \
--disable-log-request \
--kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30201","http_port":"20008","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
```

# Single request

```shell
curl -X POST -s http://10.0.1.1:10001/v1/completions \
curl -X POST -s http://10.0.1.1:10101/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "base_model",
Expand All @@ -313,7 +313,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--dataset-name "random" \
--host 10.0.1.1 \
--port 10001 \
--port 10101 \
--random-input-len 1024 \
--random-output-len 1024 \
--ignore-eos \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,35 @@
import os
import socket
import threading
import time
import uuid
from typing import Any

import aiohttp
import msgpack
import zmq
from quart import Quart, make_response, request

count = 0
prefill_instances: dict[str, str] = {} # http_address: zmq_address
decode_instances: dict[str, str] = {} # http_address: zmq_address
prefill_instances: dict[str, Any] = {} # http_address: (zmq_address, stamp)
decode_instances: dict[str, Any] = {} # http_address: (zmq_address, stamp)

prefill_cv = threading.Condition()
decode_cv = threading.Condition()

DEFAULT_PING_SECONDS = 5


def _remove_oldest_instances(instances: dict[str, Any]) -> None:
oldest_key = next(iter(instances), None)
while oldest_key is not None:
value = instances[oldest_key]
if value[1] > time.time():
break
print(f"🔴Remove [HTTP:{oldest_key}, ZMQ:{value[0]}, stamp:{value[1]}]")
instances.pop(oldest_key, None)
oldest_key = next(iter(instances), None)


def _listen_for_register(poller, router_socket):
while True:
Expand All @@ -30,19 +45,33 @@ def _listen_for_register(poller, router_socket):
global prefill_instances
global prefill_cv
with prefill_cv:
prefill_instances[data["http_address"]] = data["zmq_address"]
node = prefill_instances.pop(data["http_address"], None)
prefill_instances[data["http_address"]] = (
data["zmq_address"],
time.time() + DEFAULT_PING_SECONDS,
)
_remove_oldest_instances(prefill_instances)

elif data["type"] == "D":
global decode_instances
global decode_cv
with decode_cv:
decode_instances[data["http_address"]] = data["zmq_address"]
node = decode_instances.pop(data["http_address"], None)
decode_instances[data["http_address"]] = (
data["zmq_address"],
time.time() + DEFAULT_PING_SECONDS,
)
_remove_oldest_instances(decode_instances)
else:
print(
"Unexpected, Received message from %s, data: %s",
remote_address,
data,
)

if node is None:
print(f"🔵Add [HTTP:{data['http_address']}, ZMQ:{data['zmq_address']}")


def start_service_discovery(hostname, port):
if not hostname:
Expand Down Expand Up @@ -104,12 +133,14 @@ async def handle_request():
with prefill_cv:
prefill_list = list(prefill_instances.items())
prefill_addr, prefill_zmq_addr = prefill_list[count % len(prefill_list)]
prefill_zmq_addr = prefill_zmq_addr[0]

global decode_instances
global decode_cv
with decode_cv:
decode_list = list(decode_instances.items())
decode_addr, decode_zmq_addr = decode_list[count % len(decode_list)]
decode_zmq_addr = decode_zmq_addr[0]

print(
f"handle_request count: {count}, [HTTP:{prefill_addr}, "
Expand Down Expand Up @@ -149,6 +180,6 @@ async def handle_request():


if __name__ == "__main__":
t = start_service_discovery("0.0.0.0", 30001)
app.run(host="0.0.0.0", port=10001)
t = start_service_discovery("0.0.0.0", 30201)
app.run(host="0.0.0.0", port=10101)
t.join()
Loading