| 
 | 1 | +# NWOR + SCV Profiling Guide  | 
 | 2 | + | 
 | 3 | +## Overview  | 
 | 4 | + | 
 | 5 | +This guide explains what NWOR and SCV optimize, what metrics to measure, and which tools to use.  | 
 | 6 | + | 
 | 7 | +---  | 
 | 8 | + | 
 | 9 | +## NWOR (Non-blocking Write-Or-Read) Stage Mode  | 
 | 10 | + | 
 | 11 | +### What NWOR Optimizes  | 
 | 12 | +**Problem**: Speculative decoding writes draft tokens to KV cache, then overwrites them when rejected (wasted DRAM bandwidth).  | 
 | 13 | + | 
 | 14 | +**Solution**: Stage draft tokens in temporary buffers, only write accepted tokens to KV cache.  | 
 | 15 | + | 
 | 16 | +### What NWOR Does NOT Optimize  | 
 | 17 | +- ❌ Latency (adds 2-3% overhead from staging logic)  | 
 | 18 | +- ❌ Computation (same model forward passes)  | 
 | 19 | +- ❌ CPU time (minimal impact)  | 
 | 20 | + | 
 | 21 | +### What NWOR DOES Optimize  | 
 | 22 | +- ✅ **DRAM write bandwidth** (primary benefit)  | 
 | 23 | +- ✅ **Memory write pressure** (reduces cache contention)  | 
 | 24 | +- ✅ **KV cache write traffic** (only accepted tokens)  | 
 | 25 | + | 
 | 26 | +### Metrics to Measure  | 
 | 27 | + | 
 | 28 | +| Metric | Tool | Purpose | Expected Result |  | 
 | 29 | +|--------|------|---------|-----------------|  | 
 | 30 | +| **`dram__bytes_write.sum`** | NCU | Total DRAM writes | ↓ 10-15% (matches rejection rate) |  | 
 | 31 | +| **`dram__bytes_read.sum`** | NCU | Total DRAM reads | No change (same reads) |  | 
 | 32 | +| **`lts__t_sectors_op_write.sum`** | NCU | L2 cache write traffic | ↓ 10-15% (tracks DRAM writes) |  | 
 | 33 | +| **`dram__throughput.avg.pct_of_peak`** | NCU | Memory bandwidth utilization | ↓ if memory-bound |  | 
 | 34 | +| **Latency (E2E)** | Benchmark | Total request latency | ↑ 2-3% (staging overhead) |  | 
 | 35 | +| **Tokens Staged** | vLLM metrics | Draft tokens staged | Should equal draft tokens |  | 
 | 36 | +| **Tokens Committed** | vLLM metrics | Staged tokens written | Should equal accepted tokens |  | 
 | 37 | +| **Writes Saved %** | vLLM metrics | (staged - committed) / staged | Should be ~100% |  | 
 | 38 | + | 
 | 39 | +### When NWOR Shows Benefits  | 
 | 40 | + | 
 | 41 | +✅ **Large batches** (32-128 requests) → more rejected writes  | 
 | 42 | +✅ **High memory pressure** → bandwidth bottleneck visible  | 
 | 43 | +✅ **Long sequences** → larger KV cache footprint  | 
 | 44 | +✅ **Multi-GPU** → inter-GPU bandwidth constrained  | 
 | 45 | +✅ **Sustained workload** → cumulative bandwidth savings  | 
 | 46 | + | 
 | 47 | +❌ **Small batches** (8 requests) → low memory pressure, overhead dominates  | 
 | 48 | +❌ **Short runs** → overhead visible, benefits don't accumulate  | 
 | 49 | + | 
 | 50 | +### How to Profile NWOR  | 
 | 51 | + | 
 | 52 | +```bash  | 
 | 53 | +# 1. Run NCU bandwidth test  | 
 | 54 | +./run_ncu_bandwidth_test.sh  | 
 | 55 | + | 
 | 56 | +# 2. Check key metrics  | 
 | 57 | +python3 << EOF  | 
 | 58 | +import json  | 
 | 59 | +with open('sweeps/ncu_analysis/small_baseline_t0.7.json') as f:  | 
 | 60 | +    baseline = json.load(f)  | 
 | 61 | +with open('sweeps/ncu_analysis/small_nwor_t0.7.json') as f:  | 
 | 62 | +    nwor = json.load(f)  | 
 | 63 | +
  | 
 | 64 | +base_writes = baseline['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']  | 
 | 65 | +nwor_writes = nwor['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']  | 
 | 66 | +
  | 
 | 67 | +reduction_pct = ((base_writes - nwor_writes) / base_writes) * 100  | 
 | 68 | +print(f"DRAM Write Reduction: {reduction_pct:.2f}%")  | 
 | 69 | +print(f"Baseline: {base_writes/1e9:.4f} GB")  | 
 | 70 | +print(f"NWOR:     {nwor_writes/1e9:.4f} GB")  | 
 | 71 | +print(f"Saved:    {(base_writes - nwor_writes)/1e9:.4f} GB")  | 
 | 72 | +EOF  | 
 | 73 | +```  | 
 | 74 | + | 
 | 75 | +### Expected NCU Output  | 
 | 76 | + | 
 | 77 | +```  | 
 | 78 | +Baseline (NWOR off):  | 
 | 79 | +  DRAM Writes:  1,250,000,000 bytes (1.25 GB)  | 
 | 80 | +  DRAM Reads:   5,000,000,000 bytes (5.00 GB)  | 
 | 81 | +  L2 Writes:    45,200,000 sectors  | 
 | 82 | +  BW Util:      12.50%  | 
 | 83 | +
  | 
 | 84 | +NWOR Stage:  | 
 | 85 | +  DRAM Writes:  1,125,000,000 bytes (1.13 GB)  ← 10% reduction!  | 
 | 86 | +  DRAM Reads:   5,000,000,000 bytes (5.00 GB)  ← Same  | 
 | 87 | +  L2 Writes:    40,700,000 sectors              ← 10% reduction  | 
 | 88 | +  BW Util:      11.80%                           ← Lower  | 
 | 89 | +
  | 
 | 90 | +Delta: -125 MB (-10%) in DRAM writes  | 
 | 91 | +```  | 
 | 92 | + | 
 | 93 | +---  | 
 | 94 | + | 
 | 95 | +## SCV (Speculative Comparison Vectorized) Graph Mode  | 
 | 96 | + | 
 | 97 | +### What SCV Optimizes  | 
 | 98 | +**Problem**: Mask computation for speculative verification uses Python host-side loop (slow, sequential).  | 
 | 99 | + | 
 | 100 | +**Solution**: Vectorized GPU kernel + CUDA graph capture (fast, parallel, near-zero dispatch).  | 
 | 101 | + | 
 | 102 | +### What SCV Does NOT Optimize  | 
 | 103 | +- ❌ DRAM bandwidth (same memory operations)  | 
 | 104 | +- ❌ KV cache writes (NWOR's job)  | 
 | 105 | +- ❌ Model computation (same forward passes)  | 
 | 106 | + | 
 | 107 | +### What SCV DOES Optimize  | 
 | 108 | +- ✅ **Host CPU overhead** (Python loop → GPU kernel)  | 
 | 109 | +- ✅ **Kernel launch overhead** (N launches → 1 launch, or graph = 0)  | 
 | 110 | +- ✅ **CPU-GPU sync points** (loop syncs → single sync)  | 
 | 111 | +- ✅ **Parallelism** (sequential requests → parallel)  | 
 | 112 | +- ✅ **Dispatch overhead** (kernel launch ~5µs → graph replay <1µs)  | 
 | 113 | + | 
 | 114 | +### Metrics to Measure  | 
 | 115 | + | 
 | 116 | +| Metric | Tool | Purpose | Expected Result |  | 
 | 117 | +|--------|------|---------|-----------------|  | 
 | 118 | +| **Host CPU time** | Nsight Systems | Python loop overhead | ↓ 10-100µs (baseline has loop) |  | 
 | 119 | +| **Kernel launch count** | Nsight Systems / NCU | Number of CUDA kernel launches | N launches → 1 (or 0 with graph) |  | 
 | 120 | +| **CUDA API overhead** | Nsight Systems | cudaLaunchKernel time | ↓ 90% with graph capture |  | 
 | 121 | +| **GPU kernel time** | Nsight Systems / NCU | Actual computation time | Similar (same work, better parallelism) |  | 
 | 122 | +| **NVTX range** | Nsight Systems | "scv_compute_mask" marker | Visible in timeline |  | 
 | 123 | +| **Latency (E2E)** | Benchmark | Total request latency | ↓ 0-5µs or neutral |  | 
 | 124 | +| **`gpu__time_duration.sum`** | NCU | Total GPU time in kernel | Similar baseline vs SCV |  | 
 | 125 | +| **`sm__warps_launched.sum`** | NCU | Parallelism (warps) | Higher with SCV (parallel) |  | 
 | 126 | + | 
 | 127 | +### How to Profile SCV  | 
 | 128 | + | 
 | 129 | +```bash  | 
 | 130 | +# 1. Run Nsight Systems analysis  | 
 | 131 | +./run_scv_benefit_analysis.sh  | 
 | 132 | + | 
 | 133 | +# 2. Open reports in GUI  | 
 | 134 | +nsight-sys sweeps/scv_benefit_analysis/baseline_off_small_nsys.nsys-rep  | 
 | 135 | +nsight-sys sweeps/scv_benefit_analysis/scv_graph_small_nsys.nsys-rep  | 
 | 136 | + | 
 | 137 | +# 3. Compare timelines:  | 
 | 138 | +#    - CPU timeline: Look for Python function calls (baseline) vs kernel launch (SCV)  | 
 | 139 | +#    - GPU timeline: Count kernel launches  | 
 | 140 | +#    - CUDA API: Count cudaLaunchKernel calls  | 
 | 141 | +#    - NVTX: Find "scv_compute_mask" markers  | 
 | 142 | +```  | 
 | 143 | + | 
 | 144 | +### Expected Nsight Systems Output  | 
 | 145 | + | 
 | 146 | +**Baseline (SCV off)**:  | 
 | 147 | +```  | 
 | 148 | +CPU Timeline:  | 
 | 149 | +  ├─ Python: _compute_acceptance_mask (50µs)  | 
 | 150 | +  │   └─ for loop over requests  | 
 | 151 | +  │       ├─ cudaLaunchKernel (5µs) ← Multiple launches  | 
 | 152 | +  │       ├─ cudaLaunchKernel (5µs)  | 
 | 153 | +  │       └─ cudaLaunchKernel (5µs)  | 
 | 154 | +  └─ cudaDeviceSynchronize (10µs)  | 
 | 155 | +
  | 
 | 156 | +GPU Timeline:  | 
 | 157 | +  ├─ Kernel: compare_tokens (2µs)  | 
 | 158 | +  ├─ Kernel: compare_tokens (2µs)  | 
 | 159 | +  └─ Kernel: compare_tokens (2µs)  | 
 | 160 | +
  | 
 | 161 | +Total: ~80µs (50µs host + 30µs GPU/sync)  | 
 | 162 | +```  | 
 | 163 | + | 
 | 164 | +**SCV Graph Mode**:  | 
 | 165 | +```  | 
 | 166 | +CPU Timeline:  | 
 | 167 | +  ├─ Python: _scv_vectorized_mask (5µs) ← Single call  | 
 | 168 | +  │   └─ cudaGraphLaunch (<1µs) ← Graph replay!  | 
 | 169 | +  └─ cudaDeviceSynchronize (10µs)  | 
 | 170 | +
  | 
 | 171 | +GPU Timeline:  | 
 | 172 | +  └─ Kernel: _scv_compute_mask_inplace (6µs) ← Single kernel  | 
 | 173 | +
  | 
 | 174 | +NVTX:  | 
 | 175 | +  └─ [scv_compute_mask] (20µs total)  | 
 | 176 | +
  | 
 | 177 | +Total: ~20µs (5µs host + 6µs kernel + 10µs sync)  | 
 | 178 | +```  | 
 | 179 | + | 
 | 180 | +**Savings**: 80µs → 20µs = **60µs reduction (~75%)**  | 
 | 181 | + | 
 | 182 | +### SCV Graph Capture Benefit  | 
 | 183 | + | 
 | 184 | +**Without graph** (SCV vectorized mode):  | 
 | 185 | +- Kernel launch overhead: ~5µs per call  | 
 | 186 | +- Host dispatch: ~2µs  | 
 | 187 | +- Total overhead: ~7µs  | 
 | 188 | + | 
 | 189 | +**With graph** (SCV graph mode):  | 
 | 190 | +- Graph replay: <1µs  | 
 | 191 | +- Host dispatch: ~0.5µs  | 
 | 192 | +- Total overhead: ~1.5µs  | 
 | 193 | + | 
 | 194 | +**Graph benefit**: ~5.5µs saved per mask computation  | 
 | 195 | + | 
 | 196 | +At 100 iterations:  | 
 | 197 | +- Without graph: 7µs × 100 = 700µs  | 
 | 198 | +- With graph: 1.5µs × 100 = 150µs  | 
 | 199 | +- **Savings: 550µs (0.55ms)**  | 
 | 200 | + | 
 | 201 | +---  | 
 | 202 | + | 
 | 203 | +## Combined Analysis  | 
 | 204 | + | 
 | 205 | +### Trade-offs Summary  | 
 | 206 | + | 
 | 207 | +| Mode | Latency Impact | Bandwidth Impact | When to Use |  | 
 | 208 | +|------|----------------|------------------|-------------|  | 
 | 209 | +| **NWOR off, SCV off** | Baseline | Baseline | Never (baseline only) |  | 
 | 210 | +| **NWOR stage, SCV off** | +2-3% | -10-15% writes | High memory pressure |  | 
 | 211 | +| **NWOR off, SCV graph** | -0.5% or neutral | None | Always (no downside) |  | 
 | 212 | +| **NWOR stage, SCV graph** | +2-3% | -10-15% writes | High memory pressure |  | 
 | 213 | + | 
 | 214 | +### Recommendations  | 
 | 215 | + | 
 | 216 | +1. **SCV Graph Mode**: ✅ **Always enable**  | 
 | 217 | +   - Negligible overhead (<2%)  | 
 | 218 | +   - Some scenarios show improvement  | 
 | 219 | +   - No downside, pure benefit  | 
 | 220 | + | 
 | 221 | +2. **NWOR Stage Mode**: ⚠️ **Enable for high-throughput workloads**  | 
 | 222 | +   - Costs 2-3% latency  | 
 | 223 | +   - Saves 10-15% DRAM writes  | 
 | 224 | +   - Net positive under memory pressure (large batches, multi-GPU)  | 
 | 225 | +   - Make configurable, document trade-off  | 
 | 226 | + | 
 | 227 | +3. **Combined Mode**: ⚠️ **Use case dependent**  | 
 | 228 | +   - SCV overhead negligible, NWOR overhead dominates  | 
 | 229 | +   - Best for sustained high-throughput workloads  | 
 | 230 | +   - Profile your specific workload first  | 
 | 231 | + | 
 | 232 | +---  | 
 | 233 | + | 
 | 234 | +## Quick Reference Commands  | 
 | 235 | + | 
 | 236 | +### Measure NWOR Bandwidth Savings  | 
 | 237 | +```bash  | 
 | 238 | +./run_ncu_bandwidth_test.sh  | 
 | 239 | +# Check: sweeps/ncu_analysis/*_stats.txt  | 
 | 240 | +# Look for: dram__bytes_write.sum reduction  | 
 | 241 | +```  | 
 | 242 | + | 
 | 243 | +### Measure SCV Host Overhead Reduction  | 
 | 244 | +```bash  | 
 | 245 | +./run_scv_benefit_analysis.sh  | 
 | 246 | +# Open: nsight-sys sweeps/scv_benefit_analysis/*_nsys.nsys-rep  | 
 | 247 | +# Compare: CPU timeline, kernel launch counts  | 
 | 248 | +```  | 
 | 249 | + | 
 | 250 | +### Quick Latency-Only Test  | 
 | 251 | +```bash  | 
 | 252 | +./run_benchmark_sweep.sh  | 
 | 253 | +# Check: sweeps/*.json for latency_avg_s  | 
 | 254 | +```  | 
 | 255 | + | 
 | 256 | +---  | 
 | 257 | + | 
 | 258 | +## Interpretation  | 
 | 259 | + | 
 | 260 | +### NWOR is Working If:  | 
 | 261 | +- ✅ `nwor_writes_saved_pct` = 100%  | 
 | 262 | +- ✅ `dram__bytes_write.sum` reduced by ~10-15%  | 
 | 263 | +- ✅ `lts__t_sectors_op_write.sum` reduced proportionally  | 
 | 264 | +- ⚠️ Latency increased by 2-3% (expected overhead)  | 
 | 265 | + | 
 | 266 | +### SCV is Working If:  | 
 | 267 | +- ✅ Latency neutral or slightly improved  | 
 | 268 | +- ✅ Nsight Systems shows fewer kernel launches  | 
 | 269 | +- ✅ Nsight Systems shows reduced host CPU time  | 
 | 270 | +- ✅ NVTX markers visible for "scv_compute_mask"  | 
 | 271 | +- ✅ Graph replay <1µs (vs ~5µs kernel launch)  | 
 | 272 | + | 
 | 273 | +### Both are Working If:  | 
 | 274 | +- ✅ NWOR metrics correct (above)  | 
 | 275 | +- ✅ SCV metrics correct (above)  | 
 | 276 | +- ⚠️ Combined overhead ~= NWOR overhead (SCV adds minimal)  | 
0 commit comments