Skip to content

Profiling with Nsight Compute

Angel Castillo edited this page Mar 21, 2023 · 12 revisions

These instructions describe how to run the Nvidia NSight Compute profiling tool on AiMOS.

Unless otherwise noted, all commands were executed on an AiMOS front end node.

In the description below application refers to the code that is being run under NSight Compute to understand its performance.

NSight Compute has a graphical interface, ncu-ui, which runs on your local system and displays performance information of the application being tested, and a command line interface, ncu, that runs on the compute nodes of the system where the application being tested runs (i.e., AiMOS).

Documentation

Kokkos Lecture on Tools (skip to slide 27): https://raw.githubusercontent.com/kokkos/kokkos-tutorials/main/LectureSeries/KokkosTutorial_07_Tools.pdf

Manual: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html

Guided Analysis Video: https://www.youtube.com/watch?v=04dJ-aePYpE

NERSC Tutorial on Perlmutter: https://www.nersc.gov/users/training/events/nsight-systems-and-nsight-compute-profiling-workshop-aug2022/

  • Note, sections of this tutorial provide Perlmutter specific instructions that will not apply to AiMOS

Setup Environment

Copied from https://github.com/SCOREC/pumi-pic/wiki/Building-and-Running-on-AiMOS-RedHat-8#environment-script

module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cmake/3.20.0/1
module load netcdf-cxx4/4.3.1-gdysz4t
module load cuda/11.1-3vj4c72

export root=$PWD

export OMPI_CXX=$root/kokkos/bin/nvcc_wrapper
export OMPI_CC=gcc

Build Kokkos-Tools

The nvprof-connector is required to provide readable names of kernels being profiled.

cd $root
git clone -b master https://github.com/kokkos/kokkos-tools
cd kokkos-tools/profiling/nvprof-connector
make 

Create SLURM Scripts

Create a file named submitAimos.sh with the following contents.

sbatch -N 1 -n 1 -t 10 -p el8-rpi --gres=gpu:16g:4 ./runAimos.sh

Create a file named runAimos.sh with the following contents. Note, this is setup to collect memory and compute performance information on the pseudo push kernel (the trailing forward slash is required) of the ps_combo test application defined in the pumipic cws/dpsTesting branch (as of 7a71080).

The path to kp_nvprof_connector.so and ps_combo must be changed before running.

#!/bin/bash

module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cuda/11.1-3vj4c72

export KOKKOS_TOOLS_LIBS=$HOME/kokkos-tools/profiling/nvprof-connector/kp_nvprof_connector.so #EDIT THIS PATH

bin=~/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests/ps_combo #EDIT THIS PATH
elements=$((1024*1024))
particleFactor=5
particles=$((elements*particleFactor))
distribution=0 #even
structure=3 #dps
set -x
mpirun --bind-to core -np $SLURM_NPROCS \
  ncu --nvtx --nvtx-include "pseudo push/" \
  --section MemoryWorkloadAnalysis \
  --section SpeedOfLight \
  -o push_SOL_Mem_${particleFactor} \
  $bin --kokkos-map-device-id-by=mpi_rank $elements $particles $distribution $structure

Make the scripts executable:

chmod +x *Aimos.sh

Submit the Job

./submitAimos.sh

If all went well, the slurm-####.out file should contain output similar to the following.

+ mpirun --bind-to core -np 1 ncu --nvtx --nvtx-include 'pseudo push/' --section MemoryWorkloadAnalysis --section SpeedOfLight -o push_SOL_Mem_5 /gpfs/u/home/MPMS/MPMSsmth/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests//ps_combo --kokkos-map-device-id-by=mpi_rank 1048576 5242880 0 3
==PROF== Connected to process 31855 (/gpfs/u/barn/MPMS/MPMSsmth/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests/ps_combo)
-----------------------------------------------------------
KokkosP: NVTX Analyzer Connector (sequence is 0, version: 20211015)
-----------------------------------------------------------
Test Command:
 /gpfs/u/home/MPMS/MPMSsmth/barn/pumipicDps/build-dcsRhel8-gcc841-pumipic/performance_tests//ps_combo 1048576 5242880 0 3 
Per particle user data size (B): 160 
Generating particle distribution with strategy: Evenly
building DPS 
Performing 1 iterations of push on each structure
Beginning push on structure 
==PROF== Profiling "cuda_parallel_launch_local_me..." - 1: 0%....50%....100% - 20 passes
Performing 1 iterations of migrate/rebuild on each structure
Beginning migrate on structure 
Timing Summary 0
Operation                                Total Time   Call Count   Average Time
 pseudo-push                                   2.11484            1        2.11484
DPS add particles                             4.08e-07            1       4.08e-07
DPS copy particles                            5.14e-07            1       5.14e-07
DPS count/move/delete active particles      0.00167106            1     0.00167106
DPS particle migration                       1.376e-06            1      1.376e-06  Total Prebarrier=2.748e-06
DPS rebuild                                 0.00168286            1     0.00168286
redistribute                                 0.0307217            1      0.0307217
redistribute processes                     0.000939054            1    0.000939054

-----------------------------------------------------------
KokkosP: Finalization of NVTX Connector. Complete.
-----------------------------------------------------------
==PROF== Disconnected from process 31855
==PROF== Report: /gpfs/u/barn/MPMS/MPMSsmth/pumipicDps/push_SOL_Mem_5.ncu-rep

Load Report Files into NSight Compute GUI

Install NSight Compute on your local system (a GPU is not required) and run it. On gnu/linux, it can be launched by running ncu-ui on the command line.

Once the GUI is open select the menu item "File->Open File..." and select the push_SOL_Mem_5.ncu-rep file.