DISTSERVE

Dockerized Ray Serve (Prefill + Decode) — 2×GPU with Benchmark

This demo runs two containers on a single VM with 2× NVIDIA GPUs and comes with a simple benchmark suite.

prefill container (GPU0): Ray head + Serve controller + PrefillService
decode container (GPU1): Ray worker + DecodeService
benchmarks: latency, tokens/sec, GPU utilization logs

🔎 How It Works

1. Prefill

Processes the entire input prompt sequence.
Every layer of the model must compute over all input tokens to build the attention cache (key-value cache).
Prefill consumes very high GPU FLOPs and memory bandwidth due to large matrix multiplications:
batch × sequence_length × hidden_size.
The cost grows significantly with long contexts (e.g., >8k tokens).

👉 Prefill is usually the most GPU-intensive stage because compute scales linearly with prompt length.

2. Decode

Generates tokens one by one.
Uses the attention cache from prefill, so each step only computes with the new token (no need to recompute the full sequence).
Per-token cost is much lower than prefill, but decode must run sequentially, so latency adds up for long outputs (e.g., thousands of tokens).

👉 Decode is mainly latency-bound, while GPU load per token is much lighter compared to prefill.

⚡ Quick Start

docker compose build
MODEL_NAME=gpt2 docker compose up -d
./test.sh
./bench.sh 10 32
cat logs/bench_summary.txt
docker compose down -v

🧪 Example Benchmarks

5 minutes, heavy prompts, small decode, high concurrency:

./bench.sh -d 300 -c 64 -p 256 -t 8

📊 Metrics

The benchmark suite collects:

Latency (p50, p95, p99)

Throughput (tokens/sec)

GPU utilization (per device)

Run metadata (model, config, seed)

📌 Notes

Prefill (GPU0) = compute-intensive

Decode (GPU1) = latency-sensitive

Two containers, two GPUs → easy to extend or scale horizontally

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Dockerfile.decode		Dockerfile.decode
Dockerfile.prefill		Dockerfile.prefill
README.md		README.md
bench.py		bench.py
bench.sh		bench.sh
docker-compose.yml		docker-compose.yml
entrypoint_decode.sh		entrypoint_decode.sh
entrypoint_prefill.sh		entrypoint_prefill.sh
gpu_monitor.sh		gpu_monitor.sh
image.png		image.png
main.py		main.py
requirements.txt		requirements.txt
serve_federated.py		serve_federated.py
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISTSERVE

🔎 How It Works

1. Prefill

2. Decode

⚡ Quick Start

🧪 Example Benchmarks

📊 Metrics

📌 Notes

About

Uh oh!

Releases

Packages

Languages

phuongdo/DistServe

Folders and files

Latest commit

History

Repository files navigation

DISTSERVE

🔎 How It Works

1. Prefill

2. Decode

⚡ Quick Start

🧪 Example Benchmarks

📊 Metrics

📌 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages