Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] implement disaggregated prefilling via KV cache transfer #6170

Closed
wants to merge 158 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
158 commits
Select commit Hold shift + click to select a range
6fc14d4
add idealized disagg prefill benchmark
KuntaiDu Jul 6, 2024
69d1514
add main
KuntaiDu Jul 6, 2024
2bc8e79
fix typo
KuntaiDu Jul 6, 2024
3ea715d
use mkdir -p to avoid error
KuntaiDu Jul 6, 2024
3656f8a
fix bug
KuntaiDu Jul 6, 2024
f8cb6fc
disable log request from vllm server, and mute curl
KuntaiDu Jul 6, 2024
d4b23c0
add disaggregated prefilling benchmark
KuntaiDu Jul 6, 2024
a942663
do not launch 2 vllm instance
KuntaiDu Jul 7, 2024
540d362
reduce # of prompt to half
KuntaiDu Jul 7, 2024
4b0a7ff
reduce input len by 1
KuntaiDu Jul 7, 2024
2989656
adjust filename
KuntaiDu Jul 7, 2024
69f729c
create 4x sonnet
KuntaiDu Jul 7, 2024
43e1e5e
adjust setup
KuntaiDu Jul 7, 2024
29a7b88
add benchmark
KuntaiDu Jul 7, 2024
4d31316
allow prefix input len == sonnet input len
KuntaiDu Jul 8, 2024
4e336fc
add parameter sweeping
KuntaiDu Jul 8, 2024
2770c61
aadjust firmat
KuntaiDu Jul 8, 2024
80061d2
rename script
KuntaiDu Jul 8, 2024
8c0a9dc
align naming
KuntaiDu Jul 8, 2024
7d84965
adjust qps
KuntaiDu Jul 9, 2024
5ac5249
adjust swap range
KuntaiDu Jul 9, 2024
8f25985
remove results
KuntaiDu Jul 9, 2024
2363fa0
adjust benchmark results so that there are 150 output tokens by defau…
KuntaiDu Jul 9, 2024
3db988c
add example usage for disaggregated prefill
KuntaiDu Jul 17, 2024
00e46de
add environment variable for disaggregated prefill
KuntaiDu Jul 17, 2024
de434d9
add a new distributed group for disaggregated prefill NCCL communication
KuntaiDu Jul 17, 2024
f157f6b
only inflate the world size inside parallel_state.py
KuntaiDu Jul 18, 2024
de82c3c
add more log information
KuntaiDu Jul 18, 2024
69ce0e0
specify vllm port
KuntaiDu Jul 18, 2024
e3dc2e9
avoid switching to unused ports in disaggregated prefilling
KuntaiDu Jul 18, 2024
f164aa7
Merge branch 'main' into kuntai-disagg
KuntaiDu Jul 18, 2024
18fe19c
adjust parallel state to include _DISAGG distributed group
KuntaiDu Jul 18, 2024
94cadb8
offset global rank for decoding instances
KuntaiDu Jul 18, 2024
ded5d92
adjust naming: use prefill and decode instead of prefilling and decoding
KuntaiDu Jul 18, 2024
709ae05
adjust the example: let the decode process in foreground for debugging
KuntaiDu Jul 18, 2024
2ab44d4
adjust logger format
KuntaiDu Jul 18, 2024
2213881
test if the P2P cache stucks when no disaggregated prefilling
KuntaiDu Jul 18, 2024
544f5cb
let decode instance sleep, to avoid generating P2P cache simultaneously
KuntaiDu Jul 18, 2024
04d319a
continue disaggregated prefill debugging
KuntaiDu Jul 18, 2024
2e0f02c
offset world group for decoding instance
KuntaiDu Jul 18, 2024
fd5f115
a syntax fix
KuntaiDu Jul 18, 2024
8d90e6a
bug fix
KuntaiDu Jul 18, 2024
a9474a7
specify the source of get_open_port
KuntaiDu Jul 18, 2024
701b087
document why specifying the source of get_open_port
KuntaiDu Jul 18, 2024
fa5d71f
add VLLM_TRACE_FUNCTION to track the call stack
KuntaiDu Jul 19, 2024
e2faede
fix customadapter bug
KuntaiDu Jul 19, 2024
76b6c5e
add parallel state logs for debugging
KuntaiDu Jul 19, 2024
cb6d6a5
add sleep when initializing parallel state
KuntaiDu Jul 19, 2024
fe8fb47
only log when rank%4==0
KuntaiDu Jul 19, 2024
cc89bfb
only log when rank%4==0
KuntaiDu Jul 19, 2024
531bdf3
bug fix
KuntaiDu Jul 19, 2024
1804656
also only log when rank=4 in custom all reduce
KuntaiDu Jul 19, 2024
81c8640
add debuging statement around broadcast
KuntaiDu Jul 19, 2024
5ba142c
debug init_world_group
KuntaiDu Jul 19, 2024
cc939cf
put the log inside a text file
KuntaiDu Jul 19, 2024
8ac9266
init DISAGG first
KuntaiDu Jul 19, 2024
58849fa
init DISAGG before global
KuntaiDu Jul 19, 2024
08797e2
put it behind world_size
KuntaiDu Jul 19, 2024
4ff4cd6
add more debug information in pynccl
KuntaiDu Jul 19, 2024
b09e4e6
typo fix
KuntaiDu Jul 19, 2024
583de97
more debug
KuntaiDu Jul 19, 2024
74bcfff
more debug info
KuntaiDu Jul 19, 2024
2175825
put every output
KuntaiDu Jul 19, 2024
3e07770
remove unnecessary sleep
KuntaiDu Jul 19, 2024
a22e5cd
add sucess statement
KuntaiDu Jul 19, 2024
2c0c27d
add debug statement
KuntaiDu Jul 19, 2024
a783787
log rank in success message
KuntaiDu Jul 19, 2024
79f0b06
sleep based on rank to avoid message overlapping
KuntaiDu Jul 19, 2024
b17f20f
increase torch debug level
KuntaiDu Jul 19, 2024
025f209
sleep
KuntaiDu Jul 19, 2024
32292f1
set gloo debugging level to trace
KuntaiDu Jul 19, 2024
389fb24
reduce debugging commands
KuntaiDu Jul 19, 2024
1b38b29
avoid initializing NCCL first
KuntaiDu Jul 19, 2024
bb8c08a
check
KuntaiDu Jul 19, 2024
25a7cf3
locate the hanging line
KuntaiDu Jul 19, 2024
999bd72
add rank to CPU group
KuntaiDu Jul 19, 2024
3428ea6
narrow case
KuntaiDu Jul 19, 2024
91e3ed2
bug fix: need to align the distributed groups between prefill and dec…
KuntaiDu Jul 20, 2024
3dd2275
add disaggregated prefilling for flashinfer
KuntaiDu Jul 23, 2024
2b13f3c
adjust comments
KuntaiDu Jul 23, 2024
8c3f209
add logging for send and recv
KuntaiDu Jul 23, 2024
c6a5e57
turn off chunked prefill to use flashinfer kernel
KuntaiDu Jul 23, 2024
b3c47f3
confirm which backend is being used
KuntaiDu Jul 23, 2024
f05540c
remove debugging from parallel_state, its too much...
KuntaiDu Jul 23, 2024
eb96fe7
add disagg prefill for flash attn backend
KuntaiDu Jul 23, 2024
09d5588
edit flash attn to assign prefill_meta first
KuntaiDu Jul 23, 2024
43077e7
use print instead of attn
KuntaiDu Jul 23, 2024
f716737
make data contiguous
KuntaiDu Jul 23, 2024
0d07251
add more debug message
KuntaiDu Jul 23, 2024
2177737
turn on logging
KuntaiDu Jul 23, 2024
a293bd0
more debug prints in flash_attn
KuntaiDu Jul 23, 2024
cc7f646
remove enforce eager
KuntaiDu Jul 23, 2024
68f3d16
adjust printing order in flash attn
KuntaiDu Jul 23, 2024
21a61b9
avoid sending & receiving output tensor during profile run
KuntaiDu Jul 23, 2024
691cad7
also log the device
KuntaiDu Jul 23, 2024
c057f19
adjust implementation
KuntaiDu Jul 23, 2024
82b73bb
finish adjustment
KuntaiDu Jul 23, 2024
6db1d48
fall back to original flashinfer
KuntaiDu Jul 23, 2024
9e53071
Merge branch 'vllm-project:main' into kuntai-disagg
KuntaiDu Jul 23, 2024
dbaade7
add space
KuntaiDu Jul 23, 2024
f572db8
clean config.py
KuntaiDu Jul 23, 2024
9ebf3ad
keep flashattn implementation
KuntaiDu Jul 23, 2024
67b1c2e
commit changes that will be merged
KuntaiDu Jul 23, 2024
4acad6a
Merge branch 'kuntai-disagg' of https://github.com/KuntaiDu/vllm into…
KuntaiDu Jul 23, 2024
3abca47
revert custom allreduce changes
KuntaiDu Jul 23, 2024
0ce251b
remove debug logs from the file
KuntaiDu Jul 23, 2024
1f3ac2b
revert changes to prefix_caching_block --- unnecessary
KuntaiDu Jul 23, 2024
c93bf33
revert changes
KuntaiDu Jul 23, 2024
8dcaf43
fix typos
KuntaiDu Jul 23, 2024
4d83813
add example usage to disaggregated prefill
KuntaiDu Jul 23, 2024
11c3ace
can only use print instead of log.debug...
KuntaiDu Jul 23, 2024
0bd0cc9
kill vllm instance after run
KuntaiDu Jul 23, 2024
39973bb
add proxy server for disaggregated prefilling
KuntaiDu Jul 24, 2024
13a6d12
update disagg proxy server
KuntaiDu Jul 24, 2024
81cad25
add debug message for proxy server
KuntaiDu Jul 24, 2024
198931b
fix bug
KuntaiDu Jul 24, 2024
7412767
increase nccl buff size
KuntaiDu Jul 24, 2024
bd6f41b
increase nccl buffer size
KuntaiDu Jul 24, 2024
20f9de1
add debug flag
KuntaiDu Jul 24, 2024
11850d5
reduce gpu memory usage
KuntaiDu Jul 24, 2024
d6ad9bd
fix syntax bug
KuntaiDu Jul 24, 2024
57dd656
temporarily lift up nccl buffer size for send and recv
KuntaiDu Jul 24, 2024
9379fbb
reduce nccl buffer size and see if bug fixed
KuntaiDu Jul 24, 2024
c23d841
fix
KuntaiDu Jul 24, 2024
7fc62b4
add debug info -- see which layer the prefill instance got stuck
KuntaiDu Jul 24, 2024
e542366
remove nccl debug -- it is too loud
KuntaiDu Jul 24, 2024
e9f7dc2
change buffer size only for disagg communicator
KuntaiDu Jul 24, 2024
18ded4c
disable nccl debug
KuntaiDu Jul 24, 2024
e814f82
use isend and irecv
KuntaiDu Jul 24, 2024
a3399b3
try to increase the buffer size
KuntaiDu Jul 24, 2024
5e18bd7
Merge branch 'main' into kuntai-disagg
KuntaiDu Jul 30, 2024
e4e60d9
bug fix, now disaggregated prefill should work as expected
KuntaiDu Jul 31, 2024
87fbfae
add proxy server
KuntaiDu Jul 31, 2024
fa664c0
startr slow -- using pp=1 and tp=1
KuntaiDu Aug 1, 2024
6bf7583
adjust the API
KuntaiDu Aug 1, 2024
6aad5cc
support batch size >1
KuntaiDu Aug 2, 2024
e934286
update model runner
KuntaiDu Aug 2, 2024
b68435a
move group coordinator to a separate file, move disagg implementation…
KuntaiDu Aug 4, 2024
e54f7a3
no need to send during attention
KuntaiDu Aug 4, 2024
23c9949
debug tp
KuntaiDu Aug 4, 2024
87cb78b
resolve conflicts
KuntaiDu Aug 4, 2024
06a526a
Fix several bugs: tensor device placement, misc performance optimizat…
KuntaiDu Aug 5, 2024
34e6bb3
remove useless comments
KuntaiDu Aug 5, 2024
55bf3bf
update disaggregated prefill example
KuntaiDu Aug 5, 2024
b525510
add disaggregated prefill overhead benchmark
KuntaiDu Aug 6, 2024
ee6a6ec
change disagg prefill proxy server to support non-streaming case
KuntaiDu Aug 7, 2024
f3cc91d
avoid detokenizing the first token in prefill instance -- for shorter…
KuntaiDu Aug 7, 2024
0582265
add failure test cases --- try switching to another machine
KuntaiDu Aug 7, 2024
89d4ca4
update
KuntaiDu Aug 7, 2024
9f4dba2
remove debugging information
KuntaiDu Aug 8, 2024
aa55883
avoid broadcast by finding seqlen inside the attn metadata
KuntaiDu Aug 9, 2024
95df023
update examples
KuntaiDu Aug 9, 2024
d92223a
support pipeline parallel
KuntaiDu Aug 9, 2024
a8c202c
update benchmark --- compare chunked prefill w.r.t. disagg prefill
KuntaiDu Aug 10, 2024
310f3a3
mute round_robin_proxy -- too loud
KuntaiDu Aug 10, 2024
118aab1
bug fix: racing conditions, and rare cases where input hash is not ca…
KuntaiDu Aug 10, 2024
96d38b4
add visualization script
KuntaiDu Aug 11, 2024
3fc0c5c
fix bug: when KV transfer fails, do not return hidden state
KuntaiDu Aug 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,9 @@ def sample_sonnet_requests(
prefix_len: int,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, str, int, int]]:
assert (
input_len > prefix_len
), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'."
assert input_len >= prefix_len, (
"'args.sonnet-input-len' must be greater than or equal to "
"'args.prefix-input-len'.")

# Load the dataset.
with open(dataset_path) as f:
Expand Down
48 changes: 48 additions & 0 deletions benchmarks/disagg_benchmarks/analyze_benchmark_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@

import argparse
import json
import yaml
import os
from pathlib import Path

def load(path):

with open(str(path), 'r') as f:
return json.loads(f.read())

def main(args):

results = Path(args.results_folder)

chunk = load(results / "chunked_prefill_tp4.json")
prefill = load(results / "disagg_prefill_tp4.json")
decode = load(results / "disagg_decode_tp4.json")

ttft_ratio = chunk["mean_ttft_ms"] / prefill["mean_ttft_ms"]
itl_ratio = chunk["mean_itl_ms"] / decode["mean_itl_ms"]
prefill_decode_ratio = prefill["mean_ttft_ms"] / (decode["mean_itl_ms"] * args.output_len)

with open(results / args.output_file, 'a') as f:
f.write(yaml.dump([{
'qps': args.qps,
'output_len': args.output_len,
'prefill_decode_ratio': prefill_decode_ratio,
'ttft_ratio': ttft_ratio,
'itl_ratio': itl_ratio,
"chunk_ttft": chunk["mean_ttft_ms"],
"chunk_itl": chunk["mean_itl_ms"],
"disagg_ttft": prefill["mean_ttft_ms"],
"disagg_itl": decode["mean_itl_ms"]
}]))


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Analyze benchmark results")
parser.add_argument("--results-folder", required=True, help="Path to the results folder")
parser.add_argument("--output-len", type=int, required=True, help="Target output length")
parser.add_argument("--qps", type=int, required=True, help="Target QPS")
parser.add_argument("--output-file", type=str, default="chunk_vs_disagg.yaml")

args = parser.parse_args()
main(args)

148 changes: 148 additions & 0 deletions benchmarks/disagg_benchmarks/disagg_overhead_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
#!/bin/bash

# Requirement: 8x H100 GPUs.


# Model: neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV
# Query: 2048 input tokens, 11 output tokens, QPS 4, 500 requests
# Resource: 8x H100
# Approaches:
# 1. Chunked prefill: 1 vllm instance with tp=8
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
# Prefilling instance: max_output_token=1
# Decoding instance: force the input tokens be the same across requests to bypass prefilling

set -ex

kill_gpu_processes() {
# kill all processes on GPU.
pkill pt_main_thread
sleep 10

# remove vllm config file
rm -rf ~/.config/vllm

# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}

wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}


benchmark() {

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
export VLLM_PORT=12345

# compare chunked prefill with disaggregated prefill

results_folder="./results"
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=50
qps=$1
prefix_len=50
input_len=2048
output_len=$2

# large model
VLLM_RPC_PORT=5570 VLLM_DISAGG_PREFILL_ROLE=prefill CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
-tp 4 \
--max-model-len 30000 \
--gpu-memory-utilization 0.8 &
VLLM_RPC_PORT=5580 VLLM_DISAGG_PREFILL_ROLE=decode CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
-tp 4 \
--max-model-len 30000 \
--gpu-memory-utilization 0.8 &

wait_for_server 8100
wait_for_server 8200

# let the prefill instance finish prefill
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_2xtp4.json \
--request-rate $qps


# send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl.
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_2xtp4.json \
--request-rate $qps
kill_gpu_processes

}


main() {

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)

pip install quart httpx

cd "$(dirname "$0")"

cd ..
# create sonnet-4x.txt
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks

rm -rf results
mkdir results

default_qps=1
default_output_len=1
benchmark $default_qps $default_output_len

}


main "$@"
172 changes: 172 additions & 0 deletions benchmarks/disagg_benchmarks/disagg_performance_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/bin/bash

# Requirement: 8x H100 GPUs.


# Model: neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV
# Query: 2048 input tokens, 11 output tokens, QPS 4, 500 requests
# Resource: 8x H100
# Approaches:
# 1. Chunked prefill: 1 vllm instance with tp=8
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
# Prefilling instance: max_output_token=1
# Decoding instance: force the input tokens be the same across requests to bypass prefilling

set -ex

kill_gpu_processes() {
# kill all processes on GPU.
pkill -f pt_main_thread
pkill -f python3
pkill -f round_robin_proxy.sh
ps -e | grep pt_main_thread | awk '{print $1}' | xargs kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1
}

wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}


launch_chunked_prefill() {
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
# disagg prefill
VLLM_RPC_PORT=5570 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
-tp 4 \
--max-model-len 30000 \
--disable-log-stats \
--disable-log-requests \
--enable-chunked-prefill \
--gpu-memory-utilization 0.8 &
VLLM_RPC_PORT=5580 CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
-tp 4 \
--max-model-len 30000 \
--disable-log-stats \
--disable-log-requests \
--enable-chunked-prefill \
--gpu-memory-utilization 0.8 &
wait_for_server 8100
wait_for_server 8200
bash round_robin_proxy.sh &
sleep 1
}


launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
# disagg prefill
VLLM_PORT=12345 VLLM_RPC_PORT=5570 VLLM_DISAGG_PREFILL_ROLE=prefill CUDA_VISIBLE_DEVICES=0,1,2,3 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
-tp 4 \
--max-model-len 30000 \
--disable-log-stats \
--disable-log-requests \
--gpu-memory-utilization 0.8 &
VLLM_PORT=12345 VLLM_RPC_PORT=5580 VLLM_DISAGG_PREFILL_ROLE=decode CUDA_VISIBLE_DEVICES=4,5,6,7 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
-tp 4 \
--max-model-len 30000 \
--disable-log-stats \
--disable-log-requests \
--gpu-memory-utilization 0.8 &
wait_for_server 8100
wait_for_server 8200
python3 disagg_prefill_proxy_server.py &
sleep 1
}


benchmark() {
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-70B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=400
qps=$1
prefix_len=50
input_len=2048
output_len=$2
tag=$3

python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len $output_len \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename $tag-qps-$qps.json \
--request-rate $qps

sleep 2

}


main() {

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)

pip install quart httpx

cd "$(dirname "$0")"

cd ..
# create sonnet-4x.txt so that we can sample 2048 tokens for input
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks

rm -rf results
mkdir results

default_qps=10
default_output_len=150

export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')

launch_chunked_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len chunked_prefill
done
kill_gpu_processes

launch_disagg_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len disagg_prefill
done
kill_gpu_processes

}


main "$@"
Loading