Skip to content

phuongdo/DistServe

Repository files navigation

DISTSERVE

Dockerized Ray Serve (Prefill + Decode) — 2×GPU with Benchmark

This demo runs two containers on a single VM with 2× NVIDIA GPUs and comes with a simple benchmark suite.

  • prefill container (GPU0): Ray head + Serve controller + PrefillService
  • decode container (GPU1): Ray worker + DecodeService
  • benchmarks: latency, tokens/sec, GPU utilization logs

🔎 How It Works

alt text

1. Prefill

  • Processes the entire input prompt sequence.
  • Every layer of the model must compute over all input tokens to build the attention cache (key-value cache).
  • Prefill consumes very high GPU FLOPs and memory bandwidth due to large matrix multiplications:
    batch × sequence_length × hidden_size.
  • The cost grows significantly with long contexts (e.g., >8k tokens).

👉 Prefill is usually the most GPU-intensive stage because compute scales linearly with prompt length.

2. Decode

  • Generates tokens one by one.
  • Uses the attention cache from prefill, so each step only computes with the new token (no need to recompute the full sequence).
  • Per-token cost is much lower than prefill, but decode must run sequentially, so latency adds up for long outputs (e.g., thousands of tokens).

👉 Decode is mainly latency-bound, while GPU load per token is much lighter compared to prefill.

⚡ Quick Start

docker compose build
MODEL_NAME=gpt2 docker compose up -d
./test.sh
./bench.sh 10 32
cat logs/bench_summary.txt
docker compose down -v

🧪 Example Benchmarks

5 minutes, heavy prompts, small decode, high concurrency:

./bench.sh -d 300 -c 64 -p 256 -t 8

📊 Metrics

The benchmark suite collects:

Latency (p50, p95, p99)

Throughput (tokens/sec)

GPU utilization (per device)

Run metadata (model, config, seed)

📌 Notes

Prefill (GPU0) = compute-intensive

Decode (GPU1) = latency-sensitive

Two containers, two GPUs → easy to extend or scale horizontally

About

DistServe: Disaggregating Prefill and Decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published