Skip to content

Commit 84ff352

Browse files
authored
Title: Boost NWOR defer pipeline and mask commit path (and other performance fixes to reduce overhead)
* cache slot mapping per window and hoist segment computation * add mask-aware commit fast path with contiguous narrow reuse * harden mask validation and fallback metrics * expand deferred writer test coverage * and more!
2 parents 50ce473 + dcdc56b commit 84ff352

15 files changed

+4289
-277
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,3 +218,4 @@ csrc/moe/marlin_moe_wna16/kernel_*
218218

219219
# Ignore ep_kernels_workspace folder
220220
ep_kernels_workspace/
221+
sweeps/

PROFILING_GUIDE.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# NWOR + SCV Profiling Guide
2+
3+
## Overview
4+
5+
This guide explains what NWOR and SCV optimize, what metrics to measure, and which tools to use.
6+
7+
---
8+
9+
## NWOR (Non-blocking Write-Or-Read) Stage Mode
10+
11+
### What NWOR Optimizes
12+
**Problem**: Speculative decoding writes draft tokens to KV cache, then overwrites them when rejected (wasted DRAM bandwidth).
13+
14+
**Solution**: Stage draft tokens in temporary buffers, only write accepted tokens to KV cache.
15+
16+
### What NWOR Does NOT Optimize
17+
- ❌ Latency (adds 2-3% overhead from staging logic)
18+
- ❌ Computation (same model forward passes)
19+
- ❌ CPU time (minimal impact)
20+
21+
### What NWOR DOES Optimize
22+
-**DRAM write bandwidth** (primary benefit)
23+
-**Memory write pressure** (reduces cache contention)
24+
-**KV cache write traffic** (only accepted tokens)
25+
26+
### Metrics to Measure
27+
28+
| Metric | Tool | Purpose | Expected Result |
29+
|--------|------|---------|-----------------|
30+
| **`dram__bytes_write.sum`** | NCU | Total DRAM writes | ↓ 10-15% (matches rejection rate) |
31+
| **`dram__bytes_read.sum`** | NCU | Total DRAM reads | No change (same reads) |
32+
| **`lts__t_sectors_op_write.sum`** | NCU | L2 cache write traffic | ↓ 10-15% (tracks DRAM writes) |
33+
| **`dram__throughput.avg.pct_of_peak`** | NCU | Memory bandwidth utilization | ↓ if memory-bound |
34+
| **Latency (E2E)** | Benchmark | Total request latency | ↑ 2-3% (staging overhead) |
35+
| **Tokens Staged** | vLLM metrics | Draft tokens staged | Should equal draft tokens |
36+
| **Tokens Committed** | vLLM metrics | Staged tokens written | Should equal accepted tokens |
37+
| **Writes Saved %** | vLLM metrics | (staged - committed) / staged | Should be ~100% |
38+
39+
### When NWOR Shows Benefits
40+
41+
**Large batches** (32-128 requests) → more rejected writes
42+
**High memory pressure** → bandwidth bottleneck visible
43+
**Long sequences** → larger KV cache footprint
44+
**Multi-GPU** → inter-GPU bandwidth constrained
45+
**Sustained workload** → cumulative bandwidth savings
46+
47+
**Small batches** (8 requests) → low memory pressure, overhead dominates
48+
**Short runs** → overhead visible, benefits don't accumulate
49+
50+
### How to Profile NWOR
51+
52+
```bash
53+
# 1. Run NCU bandwidth test
54+
./run_ncu_bandwidth_test.sh
55+
56+
# 2. Check key metrics
57+
python3 << EOF
58+
import json
59+
with open('sweeps/ncu_analysis/small_baseline_t0.7.json') as f:
60+
baseline = json.load(f)
61+
with open('sweeps/ncu_analysis/small_nwor_t0.7.json') as f:
62+
nwor = json.load(f)
63+
64+
base_writes = baseline['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']
65+
nwor_writes = nwor['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']
66+
67+
reduction_pct = ((base_writes - nwor_writes) / base_writes) * 100
68+
print(f"DRAM Write Reduction: {reduction_pct:.2f}%")
69+
print(f"Baseline: {base_writes/1e9:.4f} GB")
70+
print(f"NWOR: {nwor_writes/1e9:.4f} GB")
71+
print(f"Saved: {(base_writes - nwor_writes)/1e9:.4f} GB")
72+
EOF
73+
```
74+
75+
### Expected NCU Output
76+
77+
```
78+
Baseline (NWOR off):
79+
DRAM Writes: 1,250,000,000 bytes (1.25 GB)
80+
DRAM Reads: 5,000,000,000 bytes (5.00 GB)
81+
L2 Writes: 45,200,000 sectors
82+
BW Util: 12.50%
83+
84+
NWOR Stage:
85+
DRAM Writes: 1,125,000,000 bytes (1.13 GB) ← 10% reduction!
86+
DRAM Reads: 5,000,000,000 bytes (5.00 GB) ← Same
87+
L2 Writes: 40,700,000 sectors ← 10% reduction
88+
BW Util: 11.80% ← Lower
89+
90+
Delta: -125 MB (-10%) in DRAM writes
91+
```
92+
93+
---
94+
95+
## SCV (Speculative Comparison Vectorized) Graph Mode
96+
97+
### What SCV Optimizes
98+
**Problem**: Mask computation for speculative verification uses Python host-side loop (slow, sequential).
99+
100+
**Solution**: Vectorized GPU kernel + CUDA graph capture (fast, parallel, near-zero dispatch).
101+
102+
### What SCV Does NOT Optimize
103+
- ❌ DRAM bandwidth (same memory operations)
104+
- ❌ KV cache writes (NWOR's job)
105+
- ❌ Model computation (same forward passes)
106+
107+
### What SCV DOES Optimize
108+
-**Host CPU overhead** (Python loop → GPU kernel)
109+
-**Kernel launch overhead** (N launches → 1 launch, or graph = 0)
110+
-**CPU-GPU sync points** (loop syncs → single sync)
111+
-**Parallelism** (sequential requests → parallel)
112+
-**Dispatch overhead** (kernel launch ~5µs → graph replay <1µs)
113+
114+
### Metrics to Measure
115+
116+
| Metric | Tool | Purpose | Expected Result |
117+
|--------|------|---------|-----------------|
118+
| **Host CPU time** | Nsight Systems | Python loop overhead | ↓ 10-100µs (baseline has loop) |
119+
| **Kernel launch count** | Nsight Systems / NCU | Number of CUDA kernel launches | N launches → 1 (or 0 with graph) |
120+
| **CUDA API overhead** | Nsight Systems | cudaLaunchKernel time | ↓ 90% with graph capture |
121+
| **GPU kernel time** | Nsight Systems / NCU | Actual computation time | Similar (same work, better parallelism) |
122+
| **NVTX range** | Nsight Systems | "scv_compute_mask" marker | Visible in timeline |
123+
| **Latency (E2E)** | Benchmark | Total request latency | ↓ 0-5µs or neutral |
124+
| **`gpu__time_duration.sum`** | NCU | Total GPU time in kernel | Similar baseline vs SCV |
125+
| **`sm__warps_launched.sum`** | NCU | Parallelism (warps) | Higher with SCV (parallel) |
126+
127+
### How to Profile SCV
128+
129+
```bash
130+
# 1. Run Nsight Systems analysis
131+
./run_scv_benefit_analysis.sh
132+
133+
# 2. Open reports in GUI
134+
nsight-sys sweeps/scv_benefit_analysis/baseline_off_small_nsys.nsys-rep
135+
nsight-sys sweeps/scv_benefit_analysis/scv_graph_small_nsys.nsys-rep
136+
137+
# 3. Compare timelines:
138+
# - CPU timeline: Look for Python function calls (baseline) vs kernel launch (SCV)
139+
# - GPU timeline: Count kernel launches
140+
# - CUDA API: Count cudaLaunchKernel calls
141+
# - NVTX: Find "scv_compute_mask" markers
142+
```
143+
144+
### Expected Nsight Systems Output
145+
146+
**Baseline (SCV off)**:
147+
```
148+
CPU Timeline:
149+
├─ Python: _compute_acceptance_mask (50µs)
150+
│ └─ for loop over requests
151+
│ ├─ cudaLaunchKernel (5µs) ← Multiple launches
152+
│ ├─ cudaLaunchKernel (5µs)
153+
│ └─ cudaLaunchKernel (5µs)
154+
└─ cudaDeviceSynchronize (10µs)
155+
156+
GPU Timeline:
157+
├─ Kernel: compare_tokens (2µs)
158+
├─ Kernel: compare_tokens (2µs)
159+
└─ Kernel: compare_tokens (2µs)
160+
161+
Total: ~80µs (50µs host + 30µs GPU/sync)
162+
```
163+
164+
**SCV Graph Mode**:
165+
```
166+
CPU Timeline:
167+
├─ Python: _scv_vectorized_mask (5µs) ← Single call
168+
│ └─ cudaGraphLaunch (<1µs) ← Graph replay!
169+
└─ cudaDeviceSynchronize (10µs)
170+
171+
GPU Timeline:
172+
└─ Kernel: _scv_compute_mask_inplace (6µs) ← Single kernel
173+
174+
NVTX:
175+
└─ [scv_compute_mask] (20µs total)
176+
177+
Total: ~20µs (5µs host + 6µs kernel + 10µs sync)
178+
```
179+
180+
**Savings**: 80µs → 20µs = **60µs reduction (~75%)**
181+
182+
### SCV Graph Capture Benefit
183+
184+
**Without graph** (SCV vectorized mode):
185+
- Kernel launch overhead: ~5µs per call
186+
- Host dispatch: ~2µs
187+
- Total overhead: ~7µs
188+
189+
**With graph** (SCV graph mode):
190+
- Graph replay: <1µs
191+
- Host dispatch: ~0.5µs
192+
- Total overhead: ~1.5µs
193+
194+
**Graph benefit**: ~5.5µs saved per mask computation
195+
196+
At 100 iterations:
197+
- Without graph: 7µs × 100 = 700µs
198+
- With graph: 1.5µs × 100 = 150µs
199+
- **Savings: 550µs (0.55ms)**
200+
201+
---
202+
203+
## Combined Analysis
204+
205+
### Trade-offs Summary
206+
207+
| Mode | Latency Impact | Bandwidth Impact | When to Use |
208+
|------|----------------|------------------|-------------|
209+
| **NWOR off, SCV off** | Baseline | Baseline | Never (baseline only) |
210+
| **NWOR stage, SCV off** | +2-3% | -10-15% writes | High memory pressure |
211+
| **NWOR off, SCV graph** | -0.5% or neutral | None | Always (no downside) |
212+
| **NWOR stage, SCV graph** | +2-3% | -10-15% writes | High memory pressure |
213+
214+
### Recommendations
215+
216+
1. **SCV Graph Mode**: ✅ **Always enable**
217+
- Negligible overhead (<2%)
218+
- Some scenarios show improvement
219+
- No downside, pure benefit
220+
221+
2. **NWOR Stage Mode**: ⚠️ **Enable for high-throughput workloads**
222+
- Costs 2-3% latency
223+
- Saves 10-15% DRAM writes
224+
- Net positive under memory pressure (large batches, multi-GPU)
225+
- Make configurable, document trade-off
226+
227+
3. **Combined Mode**: ⚠️ **Use case dependent**
228+
- SCV overhead negligible, NWOR overhead dominates
229+
- Best for sustained high-throughput workloads
230+
- Profile your specific workload first
231+
232+
---
233+
234+
## Quick Reference Commands
235+
236+
### Measure NWOR Bandwidth Savings
237+
```bash
238+
./run_ncu_bandwidth_test.sh
239+
# Check: sweeps/ncu_analysis/*_stats.txt
240+
# Look for: dram__bytes_write.sum reduction
241+
```
242+
243+
### Measure SCV Host Overhead Reduction
244+
```bash
245+
./run_scv_benefit_analysis.sh
246+
# Open: nsight-sys sweeps/scv_benefit_analysis/*_nsys.nsys-rep
247+
# Compare: CPU timeline, kernel launch counts
248+
```
249+
250+
### Quick Latency-Only Test
251+
```bash
252+
./run_benchmark_sweep.sh
253+
# Check: sweeps/*.json for latency_avg_s
254+
```
255+
256+
---
257+
258+
## Interpretation
259+
260+
### NWOR is Working If:
261+
-`nwor_writes_saved_pct` = 100%
262+
-`dram__bytes_write.sum` reduced by ~10-15%
263+
-`lts__t_sectors_op_write.sum` reduced proportionally
264+
- ⚠️ Latency increased by 2-3% (expected overhead)
265+
266+
### SCV is Working If:
267+
- ✅ Latency neutral or slightly improved
268+
- ✅ Nsight Systems shows fewer kernel launches
269+
- ✅ Nsight Systems shows reduced host CPU time
270+
- ✅ NVTX markers visible for "scv_compute_mask"
271+
- ✅ Graph replay <1µs (vs ~5µs kernel launch)
272+
273+
### Both are Working If:
274+
- ✅ NWOR metrics correct (above)
275+
- ✅ SCV metrics correct (above)
276+
- ⚠️ Combined overhead ~= NWOR overhead (SCV adds minimal)

0 commit comments

Comments
 (0)