Dockerized Ray Serve (Prefill + Decode) — 2×GPU with Benchmark
This demo runs two containers on a single VM with 2× NVIDIA GPUs and comes with a simple benchmark suite.
- prefill container (GPU0): Ray head + Serve controller + PrefillService
- decode container (GPU1): Ray worker + DecodeService
- benchmarks: latency, tokens/sec, GPU utilization logs
- Processes the entire input prompt sequence.
- Every layer of the model must compute over all input tokens to build the attention cache (key-value cache).
- Prefill consumes very high GPU FLOPs and memory bandwidth due to large matrix multiplications:
batch × sequence_length × hidden_size. - The cost grows significantly with long contexts (e.g., >8k tokens).
👉 Prefill is usually the most GPU-intensive stage because compute scales linearly with prompt length.
- Generates tokens one by one.
- Uses the attention cache from prefill, so each step only computes with the new token (no need to recompute the full sequence).
- Per-token cost is much lower than prefill, but decode must run sequentially, so latency adds up for long outputs (e.g., thousands of tokens).
👉 Decode is mainly latency-bound, while GPU load per token is much lighter compared to prefill.
docker compose build
MODEL_NAME=gpt2 docker compose up -d
./test.sh
./bench.sh 10 32
cat logs/bench_summary.txt
docker compose down -v5 minutes, heavy prompts, small decode, high concurrency:
./bench.sh -d 300 -c 64 -p 256 -t 8The benchmark suite collects:
Latency (p50, p95, p99)
Throughput (tokens/sec)
GPU utilization (per device)
Run metadata (model, config, seed)
Prefill (GPU0) = compute-intensive
Decode (GPU1) = latency-sensitive
Two containers, two GPUs → easy to extend or scale horizontally
