-
Latency Hiding
- Thread and warp scheduling
- Launch configuration
-
Memory Hierarchy and Access Patterns
- Local storage, shared memory, and global memory
-
Understanding Bottlenecks
- Memory-bound vs compute-bound codes
- Extracting bandwidth
- Memory alignment
-
Assignment
-
Atomics and Reductions
- Atomic operations
- Classical parallel reduction
- Parallel reduction + atomics
-
Warp Shuffle Techniques
- Warp-Level reduction
- Reduction with warp shuffle
-
Assignment
-
Pinned Memory
- Definition and benefits
- Usage and functions
- Implications for host memory
-
CUDA Streams
- Overview, purpose, and semantics
- Creation, Usage, and copy-compute overlap
- Stream behavior examples and default stream considerations
-
Multi-GPU Concurrency
- Device management
- Streams across multiple GPUs
- Device-to-device data copying
- Peer-to-peer transfers
-
Assignment
An alternate perspective on all delivered materials, using the GTC presentation as a baseline for CUDA performance optimization CUDA Performance Optimization.
To proper open ".nsys-rep" files, please download the tool Nsight Systems [link]
Similarly, use Nsight Compute tool to open ".ncu-rep" files [link].