Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
f3c91b3
feat(gpu_model_runner): add SCV graph capture availability check and …
Oct 15, 2025
bd53fb9
refactor(scv): enable CUDA graph in eager and capture full round
Oct 15, 2025
249c701
Optimize NWOR commit path
yuz207 Oct 16, 2025
27af3f3
Reduce NWOR mask construction overhead
yuz207 Oct 16, 2025
841323c
Skip deferred writer tests when GPUModelRunner unavailable
yuz207 Oct 16, 2025
853be5b
Add NWOR microbench harness
yuz207 Oct 16, 2025
4045d85
Add configurable NWOR microbenchmark harness
yuz207 Oct 16, 2025
7e7ccff
Allow configuring NWOR and SCV modes in microbench
yuz207 Oct 16, 2025
e61013e
Fix NWOR request offset init
yuz207 Oct 16, 2025
9751421
Capture summary statistics and profiler hooks in NWOR harness
yuz207 Oct 16, 2025
fbe2de9
Parse Nsight Compute metrics into summary
yuz207 Oct 16, 2025
075af02
Use new speculative config API and metric snapshots in NWOR harness
yuz207 Oct 16, 2025
a1f9cc7
Add max_model_len support to NWOR harness
yuz207 Oct 16, 2025
bc92d7b
Switch NWOR microbench harness to synchronous LLM API
yuz207 Oct 16, 2025
8f23588
Guard SCV mask against out-of-bounds sampled token indices
yuz207 Oct 17, 2025
e59fa35
Add host-side SCV validation and improve error handling
yuz207 Oct 17, 2025
f22912f
Add comprehensive SCV OOB and edge case tests
yuz207 Oct 17, 2025
dd91043
Add SCV baseline measurements (all modes stable)
yuz207 Oct 17, 2025
570ab98
Document SCV Phase 0 completion and findings
yuz207 Oct 17, 2025
b98aceb
Add conditional NWOR debug logging
yuz207 Oct 17, 2025
833ce76
Fix NameError: add missing os import for NWOR debug flag
yuz207 Oct 17, 2025
40cc16b
Document NWOR/SCV full validation - all systems working
yuz207 Oct 17, 2025
b0f9959
Harden NWOR acceptance fallback and debug flag parsing
yuz207 Oct 17, 2025
87de936
Guard NWOR debug checks in fallback
yuz207 Oct 17, 2025
3c2df95
Collect NWOR metrics from engine in microbench
yuz207 Oct 17, 2025
85a974f
Normalize Nsight counter names in microbench summary
yuz207 Oct 17, 2025
0cd6f2f
Fallback to *_total counters in NWOR summary
yuz207 Oct 17, 2025
9cb71d9
Add SCV graph safety guards and fallback
yuz207 Oct 17, 2025
f7acd3c
docs(worker): add comment for profiling step in SCVGraphExecutor
yuz207 Oct 17, 2025
17ce097
Add NVTX profiling hooks for SCV mask
yuz207 Oct 17, 2025
907670e
Load torch.cuda.nvtx lazily for SCV profiling
yuz207 Oct 17, 2025
944d6cc
Instrument SCV NVTX range
yuz207 Oct 17, 2025
f0aeaf6
Add SCV graph capture with safety fixes
yuz207 Oct 17, 2025
0607efc
Fix test failures: add _make_mock_runner helper
yuz207 Oct 17, 2025
1e6f214
Fix SCV graph capture bugs and improve platform config
yuz207 Oct 18, 2025
e16d522
Disable SCV graph capture when full CUDA graphs are active
yuz207 Oct 18, 2025
6c96425
Remove incorrect SCV nested capture detection and redundant checks
yuz207 Oct 18, 2025
e4cb202
Add comprehensive profiling infrastructure for NWOR and SCV analysis
yuz207 Oct 19, 2025
d0ac344
Switch NWOR to immediate mode when full CUDA graphs are active
yuz207 Oct 19, 2025
44866a3
Guard NWOR staging from unexpected graph capture
yuz207 Oct 19, 2025
662e918
Add profiling and analysis scripts
yuz207 Oct 19, 2025
19f8bb7
Fix SCV graph cache not populating after successful capture
yuz207 Oct 19, 2025
639ab28
Optimize NWOR/SCV hot paths to reduce GPU-CPU sync overhead
yuz207 Oct 19, 2025
d87ba0d
Fix code quality issues: replace assertion with exception and add def…
yuz207 Oct 19, 2025
24e3cf9
Final code quality and correctness improvements for PR
yuz207 Oct 19, 2025
3f3054a
Fix NWOR staging across layers
yuz207 Oct 20, 2025
2cf565b
Fix NWOR staging across layers
yuz207 Oct 20, 2025
e67d4cf
Merge pull request #7 from IluvatarLabs/terragon/optimize-nwor-commit…
yuz207 Oct 20, 2025
f2118ec
Improve NWOR commit bookkeeping
yuz207 Oct 20, 2025
9a4a8bc
Avoid redundant SCV acceptance sync
yuz207 Oct 20, 2025
b9dca0d
Correct NWOR fallback and commit metrics
yuz207 Oct 20, 2025
e36c133
Simplify NWOR commit segment handling
yuz207 Oct 20, 2025
06215e1
Expand NWOR manager tests
yuz207 Oct 20, 2025
afa1da8
Add mask-based NWOR commit fast path
yuz207 Oct 20, 2025
7c8dffb
Optimize mask commit path
yuz207 Oct 20, 2025
3fa5997
Optimize NWOR deferred write path
yuz207 Oct 20, 2025
d6d4943
Cache shared slot mapping and drop redundant checks
yuz207 Oct 20, 2025
b629b98
Optimize contiguous mask commit path
yuz207 Oct 20, 2025
595d52c
Restore restricted context guard in stage_layer
yuz207 Oct 20, 2025
dcdc56b
Remove redundant CUDA context check in stage_layer
yuz207 Oct 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,4 @@ csrc/moe/marlin_moe_wna16/kernel_*

# Ignore ep_kernels_workspace folder
ep_kernels_workspace/
sweeps/
276 changes: 276 additions & 0 deletions PROFILING_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# NWOR + SCV Profiling Guide

## Overview

This guide explains what NWOR and SCV optimize, what metrics to measure, and which tools to use.

---

## NWOR (Non-blocking Write-Or-Read) Stage Mode

### What NWOR Optimizes
**Problem**: Speculative decoding writes draft tokens to KV cache, then overwrites them when rejected (wasted DRAM bandwidth).

**Solution**: Stage draft tokens in temporary buffers, only write accepted tokens to KV cache.

### What NWOR Does NOT Optimize
- ❌ Latency (adds 2-3% overhead from staging logic)
- ❌ Computation (same model forward passes)
- ❌ CPU time (minimal impact)

### What NWOR DOES Optimize
- ✅ **DRAM write bandwidth** (primary benefit)
- ✅ **Memory write pressure** (reduces cache contention)
- ✅ **KV cache write traffic** (only accepted tokens)

### Metrics to Measure

| Metric | Tool | Purpose | Expected Result |
|--------|------|---------|-----------------|
| **`dram__bytes_write.sum`** | NCU | Total DRAM writes | ↓ 10-15% (matches rejection rate) |
| **`dram__bytes_read.sum`** | NCU | Total DRAM reads | No change (same reads) |
| **`lts__t_sectors_op_write.sum`** | NCU | L2 cache write traffic | ↓ 10-15% (tracks DRAM writes) |
| **`dram__throughput.avg.pct_of_peak`** | NCU | Memory bandwidth utilization | ↓ if memory-bound |
| **Latency (E2E)** | Benchmark | Total request latency | ↑ 2-3% (staging overhead) |
| **Tokens Staged** | vLLM metrics | Draft tokens staged | Should equal draft tokens |
| **Tokens Committed** | vLLM metrics | Staged tokens written | Should equal accepted tokens |
| **Writes Saved %** | vLLM metrics | (staged - committed) / staged | Should be ~100% |

### When NWOR Shows Benefits

✅ **Large batches** (32-128 requests) → more rejected writes
✅ **High memory pressure** → bandwidth bottleneck visible
✅ **Long sequences** → larger KV cache footprint
✅ **Multi-GPU** → inter-GPU bandwidth constrained
✅ **Sustained workload** → cumulative bandwidth savings

❌ **Small batches** (8 requests) → low memory pressure, overhead dominates
❌ **Short runs** → overhead visible, benefits don't accumulate

### How to Profile NWOR

```bash
# 1. Run NCU bandwidth test
./run_ncu_bandwidth_test.sh

# 2. Check key metrics
python3 << EOF
import json
with open('sweeps/ncu_analysis/small_baseline_t0.7.json') as f:
baseline = json.load(f)
with open('sweeps/ncu_analysis/small_nwor_t0.7.json') as f:
nwor = json.load(f)

base_writes = baseline['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']
nwor_writes = nwor['summary']['per_mode'][0]['ncu_metrics']['dram__bytes_write.sum']

reduction_pct = ((base_writes - nwor_writes) / base_writes) * 100
print(f"DRAM Write Reduction: {reduction_pct:.2f}%")
print(f"Baseline: {base_writes/1e9:.4f} GB")
print(f"NWOR: {nwor_writes/1e9:.4f} GB")
print(f"Saved: {(base_writes - nwor_writes)/1e9:.4f} GB")
EOF
```

### Expected NCU Output

```
Baseline (NWOR off):
DRAM Writes: 1,250,000,000 bytes (1.25 GB)
DRAM Reads: 5,000,000,000 bytes (5.00 GB)
L2 Writes: 45,200,000 sectors
BW Util: 12.50%

NWOR Stage:
DRAM Writes: 1,125,000,000 bytes (1.13 GB) ← 10% reduction!
DRAM Reads: 5,000,000,000 bytes (5.00 GB) ← Same
L2 Writes: 40,700,000 sectors ← 10% reduction
BW Util: 11.80% ← Lower

Delta: -125 MB (-10%) in DRAM writes
```

---

## SCV (Speculative Comparison Vectorized) Graph Mode

### What SCV Optimizes
**Problem**: Mask computation for speculative verification uses Python host-side loop (slow, sequential).

**Solution**: Vectorized GPU kernel + CUDA graph capture (fast, parallel, near-zero dispatch).

### What SCV Does NOT Optimize
- ❌ DRAM bandwidth (same memory operations)
- ❌ KV cache writes (NWOR's job)
- ❌ Model computation (same forward passes)

### What SCV DOES Optimize
- ✅ **Host CPU overhead** (Python loop → GPU kernel)
- ✅ **Kernel launch overhead** (N launches → 1 launch, or graph = 0)
- ✅ **CPU-GPU sync points** (loop syncs → single sync)
- ✅ **Parallelism** (sequential requests → parallel)
- ✅ **Dispatch overhead** (kernel launch ~5µs → graph replay <1µs)

### Metrics to Measure

| Metric | Tool | Purpose | Expected Result |
|--------|------|---------|-----------------|
| **Host CPU time** | Nsight Systems | Python loop overhead | ↓ 10-100µs (baseline has loop) |
| **Kernel launch count** | Nsight Systems / NCU | Number of CUDA kernel launches | N launches → 1 (or 0 with graph) |
| **CUDA API overhead** | Nsight Systems | cudaLaunchKernel time | ↓ 90% with graph capture |
| **GPU kernel time** | Nsight Systems / NCU | Actual computation time | Similar (same work, better parallelism) |
| **NVTX range** | Nsight Systems | "scv_compute_mask" marker | Visible in timeline |
| **Latency (E2E)** | Benchmark | Total request latency | ↓ 0-5µs or neutral |
| **`gpu__time_duration.sum`** | NCU | Total GPU time in kernel | Similar baseline vs SCV |
| **`sm__warps_launched.sum`** | NCU | Parallelism (warps) | Higher with SCV (parallel) |

### How to Profile SCV

```bash
# 1. Run Nsight Systems analysis
./run_scv_benefit_analysis.sh

# 2. Open reports in GUI
nsight-sys sweeps/scv_benefit_analysis/baseline_off_small_nsys.nsys-rep
nsight-sys sweeps/scv_benefit_analysis/scv_graph_small_nsys.nsys-rep

# 3. Compare timelines:
# - CPU timeline: Look for Python function calls (baseline) vs kernel launch (SCV)
# - GPU timeline: Count kernel launches
# - CUDA API: Count cudaLaunchKernel calls
# - NVTX: Find "scv_compute_mask" markers
```

### Expected Nsight Systems Output

**Baseline (SCV off)**:
```
CPU Timeline:
├─ Python: _compute_acceptance_mask (50µs)
│ └─ for loop over requests
│ ├─ cudaLaunchKernel (5µs) ← Multiple launches
│ ├─ cudaLaunchKernel (5µs)
│ └─ cudaLaunchKernel (5µs)
└─ cudaDeviceSynchronize (10µs)

GPU Timeline:
├─ Kernel: compare_tokens (2µs)
├─ Kernel: compare_tokens (2µs)
└─ Kernel: compare_tokens (2µs)

Total: ~80µs (50µs host + 30µs GPU/sync)
```

**SCV Graph Mode**:
```
CPU Timeline:
├─ Python: _scv_vectorized_mask (5µs) ← Single call
│ └─ cudaGraphLaunch (<1µs) ← Graph replay!
└─ cudaDeviceSynchronize (10µs)

GPU Timeline:
└─ Kernel: _scv_compute_mask_inplace (6µs) ← Single kernel

NVTX:
└─ [scv_compute_mask] (20µs total)

Total: ~20µs (5µs host + 6µs kernel + 10µs sync)
```

**Savings**: 80µs → 20µs = **60µs reduction (~75%)**

### SCV Graph Capture Benefit

**Without graph** (SCV vectorized mode):
- Kernel launch overhead: ~5µs per call
- Host dispatch: ~2µs
- Total overhead: ~7µs

**With graph** (SCV graph mode):
- Graph replay: <1µs
- Host dispatch: ~0.5µs
- Total overhead: ~1.5µs

**Graph benefit**: ~5.5µs saved per mask computation

At 100 iterations:
- Without graph: 7µs × 100 = 700µs
- With graph: 1.5µs × 100 = 150µs
- **Savings: 550µs (0.55ms)**

---

## Combined Analysis

### Trade-offs Summary

| Mode | Latency Impact | Bandwidth Impact | When to Use |
|------|----------------|------------------|-------------|
| **NWOR off, SCV off** | Baseline | Baseline | Never (baseline only) |
| **NWOR stage, SCV off** | +2-3% | -10-15% writes | High memory pressure |
| **NWOR off, SCV graph** | -0.5% or neutral | None | Always (no downside) |
| **NWOR stage, SCV graph** | +2-3% | -10-15% writes | High memory pressure |

### Recommendations

1. **SCV Graph Mode**: ✅ **Always enable**
- Negligible overhead (<2%)
- Some scenarios show improvement
- No downside, pure benefit

2. **NWOR Stage Mode**: ⚠️ **Enable for high-throughput workloads**
- Costs 2-3% latency
- Saves 10-15% DRAM writes
- Net positive under memory pressure (large batches, multi-GPU)
- Make configurable, document trade-off

3. **Combined Mode**: ⚠️ **Use case dependent**
- SCV overhead negligible, NWOR overhead dominates
- Best for sustained high-throughput workloads
- Profile your specific workload first

---

## Quick Reference Commands

### Measure NWOR Bandwidth Savings
```bash
./run_ncu_bandwidth_test.sh
# Check: sweeps/ncu_analysis/*_stats.txt
# Look for: dram__bytes_write.sum reduction
```

### Measure SCV Host Overhead Reduction
```bash
./run_scv_benefit_analysis.sh
# Open: nsight-sys sweeps/scv_benefit_analysis/*_nsys.nsys-rep
# Compare: CPU timeline, kernel launch counts
```

### Quick Latency-Only Test
```bash
./run_benchmark_sweep.sh
# Check: sweeps/*.json for latency_avg_s
```

---

## Interpretation

### NWOR is Working If:
- ✅ `nwor_writes_saved_pct` = 100%
- ✅ `dram__bytes_write.sum` reduced by ~10-15%
- ✅ `lts__t_sectors_op_write.sum` reduced proportionally
- ⚠️ Latency increased by 2-3% (expected overhead)

### SCV is Working If:
- ✅ Latency neutral or slightly improved
- ✅ Nsight Systems shows fewer kernel launches
- ✅ Nsight Systems shows reduced host CPU time
- ✅ NVTX markers visible for "scv_compute_mask"
- ✅ Graph replay <1µs (vs ~5µs kernel launch)

### Both are Working If:
- ✅ NWOR metrics correct (above)
- ✅ SCV metrics correct (above)
- ⚠️ Combined overhead ~= NWOR overhead (SCV adds minimal)
Loading