-
Notifications
You must be signed in to change notification settings - Fork 15
Profiling with Nsight Compute
These instructions describe how to run the Nvidia NSight Compute profiling tool on AiMOS.
Unless otherwise noted, all commands were executed on an AiMOS front end node.
In the description below application
refers to the code that is being run under NSight Compute to understand its performance.
NSight Compute has a graphical interface, ncu-ui
, which runs on your local system and displays performance information of the application being tested, and a command line interface, ncu
, that runs on the compute nodes of the system where the application being tested runs (i.e., AiMOS).
Kokkos Lecture on Tools (skip to slide 27): https://raw.githubusercontent.com/kokkos/kokkos-tutorials/main/LectureSeries/KokkosTutorial_07_Tools.pdf
Manual: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
- list of available metrics to collect: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sections-and-rules
- command line interface documentation: https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
- hardware model for understanding what the metrics are measuring: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model
Guided Analysis Video: https://www.youtube.com/watch?v=04dJ-aePYpE
- Other videos are available here: https://developer.nvidia.com/nsight-compute-videos
NERSC Tutorial on Perlmutter: https://www.nersc.gov/users/training/events/nsight-systems-and-nsight-compute-profiling-workshop-aug2022/
- Note, sections of this tutorial provide Perlmutter specific instructions that will not apply to AiMOS
module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cmake/3.20.0/1
module load netcdf-cxx4/4.3.1-gdysz4t
module load cuda/11.1-3vj4c72
export root=$PWD
export OMPI_CXX=$root/kokkos/bin/nvcc_wrapper
export OMPI_CC=gcc
The nvprof-connector
is required to provide readable names of kernels being profiled.
cd $root
git clone -b master https://github.com/kokkos/kokkos-tools
cd kokkos-tools/profiling/nvprof-connector
make
Create a file named submitAimos.sh
with the following contents.
sbatch -N 1 -n 1 -t 10 -p el8-rpi --gres=gpu:16g:4 ./runAimos.sh
Create a file named runAimos.sh
with the following contents. Note, this is setup to collect memory and compute performance information on the pseudo push
kernel (the trailing forward slash is required) of the ps_combo
test application defined in the pumipic cws/dpsTesting
branch (as of 7a71080).
The path to kp_nvprof_connector.so
and ps_combo
must be changed before running.
#!/bin/bash
module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cuda/11.1-3vj4c72
export KOKKOS_TOOLS_LIBS=$HOME/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so #EDIT THIS PATH
bin=~/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests/ps_combo #EDIT THIS PATH
elements=$((1024*1024))
particleFactor=5
particles=$((elements*particleFactor))
distribution=0 #even
structure=3 #dps
set -x
mpirun --bind-to core -np $SLURM_NPROCS \
ncu --nvtx --nvtx-include "pseudo push/" \
--section MemoryWorkloadAnalysis \
--section SpeedOfLight \
-o push_SOL_Mem_${particleFactor} \
$bin --kokkos-map-device-id-by=mpi_rank $elements $particles $distribution $structure
Make the scripts executable:
chmod +x *Aimos.sh
./submitAimos.sh
If all went well, the slurm-####.out
file should contain output similar to the following.
+ mpirun --bind-to core -np 1 ncu --nvtx --nvtx-include 'pseudo push/' --section MemoryWorkloadAnalysis --section SpeedOfLight -o push_SOL_Mem_5 /gpfs/u/home/MPMS/MPMSsmth/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests//ps_combo --kokkos-map-device-id-by=mpi_rank 1048576 5242880 0 3
==PROF== Connected to process 31855 (/gpfs/u/barn/MPMS/MPMSsmth/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests/ps_combo)
-----------------------------------------------------------
KokkosP: NVTX Analyzer Connector (sequence is 0, version: 20211015)
-----------------------------------------------------------
Test Command:
/gpfs/u/home/MPMS/MPMSsmth/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests//ps_combo 1048576 5242880 0 3
Per particle user data size (B): 160
Generating particle distribution with strategy: Evenly
building DPS
Performing 1 iterations of push on each structure
Beginning push on structure
==PROF== Profiling "cuda_parallel_launch_local_me..." - 1: 0%....50%....100% - 20 passes
Performing 1 iterations of migrate/rebuild on each structure
Beginning migrate on structure
Timing Summary 0
Operation Total Time Call Count Average Time
pseudo-push 2.11484 1 2.11484
DPS add particles 4.08e-07 1 4.08e-07
DPS copy particles 5.14e-07 1 5.14e-07
DPS count/move/delete active particles 0.00167106 1 0.00167106
DPS particle migration 1.376e-06 1 1.376e-06 Total Prebarrier=2.748e-06
DPS rebuild 0.00168286 1 0.00168286
redistribute 0.0307217 1 0.0307217
redistribute processes 0.000939054 1 0.000939054
-----------------------------------------------------------
KokkosP: Finalization of NVTX Connector. Complete.
-----------------------------------------------------------
==PROF== Disconnected from process 31855
==PROF== Report: /gpfs/u/barn/MPMS/MPMSsmth/pumipicDps/push_SOL_Mem_5.ncu-rep
Install NSight Compute on your local system (a GPU is not required) and run it. On gnu/linux, it can be launched by running ncu-ui
on the command line.
Once the GUI is open select the menu item "File->Open File..." and select the push_SOL_Mem_5.ncu-rep
file.