|
| 1 | +# ROCm™ Systems Profiler aka `rocprof-sys` |
| 2 | +NOTE: extensive documentation on how to use `rocprof-sys` for the [GhostExchange examples](https://github.com/amd/HPCTrainingExamples/tree/main/MPI-examples/GhostExchange) is also available as `README.md` in this exercises repo. Here, we show how to use `rocprof-sys` tools considering the example in HPCTrainingExamples/HIP/jacobi. |
| 3 | + |
| 4 | +In this series of examples, we will demonstrate profiling with `rocprof-sys` on a platform using an AMD Instinct™ MI250X GPU. ROCm 6.3.2 release includes the `rocprofiler-systems` packge that you can install. |
| 5 | + |
| 6 | +Note that the focus of this exercise is on `rocprof-sys` profiler, not on how to achieve optimal performance on MI250X. |
| 7 | + |
| 8 | +First, start by cloning HPCTrainingExamples repository and loading ROCm: |
| 9 | + |
| 10 | +``` |
| 11 | +git clone https://github.com/amd/HPCTrainingExamples.git |
| 12 | +``` |
| 13 | + |
| 14 | +## Environment setup |
| 15 | + |
| 16 | +For this training, one requires recent ROCm (>=6.3) which contains `rocprof-sys`, as well as an MPI installation. |
| 17 | + |
| 18 | +``` |
| 19 | +module load rocm/6.3.2 |
| 20 | +module load openmpi |
| 21 | +
|
| 22 | +``` |
| 23 | + |
| 24 | +## Build and run |
| 25 | + |
| 26 | +No profiling yet, just check that the code compiles and runs correctly. |
| 27 | + |
| 28 | +``` |
| 29 | +cd HPCTrainingExamples/HIP/jacobi |
| 30 | +make |
| 31 | +mpirun -np 1 ./Jacobi_hip -g 1 1 |
| 32 | +``` |
| 33 | + |
| 34 | +The above run should show output that looks like this: |
| 35 | + |
| 36 | +``` |
| 37 | +Topology size: 1 x 1 |
| 38 | +Local domain size (current node): 4096 x 4096 |
| 39 | +Global domain size (all nodes): 4096 x 4096 |
| 40 | +Rank 0 selecting device 0 on host TheraC63 |
| 41 | +Starting Jacobi run. |
| 42 | +Iteration: 0 - Residual: 0.022108 |
| 43 | +Iteration: 100 - Residual: 0.000625 |
| 44 | +Iteration: 200 - Residual: 0.000371 |
| 45 | +Iteration: 300 - Residual: 0.000274 |
| 46 | +Iteration: 400 - Residual: 0.000221 |
| 47 | +Iteration: 500 - Residual: 0.000187 |
| 48 | +Iteration: 600 - Residual: 0.000163 |
| 49 | +Iteration: 700 - Residual: 0.000145 |
| 50 | +Iteration: 800 - Residual: 0.000131 |
| 51 | +Iteration: 900 - Residual: 0.000120 |
| 52 | +Iteration: 1000 - Residual: 0.000111 |
| 53 | +Stopped after 1000 iterations with residue 0.000111 |
| 54 | +Total Jacobi run time: 1.2876 sec. |
| 55 | +Measured lattice updates: 13.03 GLU/s (total), 13.03 GLU/s (per process) |
| 56 | +Measured FLOPS: 221.51 GFLOPS (total), 221.51 GFLOPS (per process) |
| 57 | +Measured device bandwidth: 1.25 TB/s (total), 1.25 TB/s (per process) |
| 58 | +``` |
| 59 | + |
| 60 | +## `rocprof-sys` config |
| 61 | + |
| 62 | +First, generate the `rocprof-sys` configuration file, and ensure that this file is known to `rocprof-sys`. |
| 63 | + |
| 64 | +``` |
| 65 | +rocprof-sys-avail -G ~/.rocprofsys.cfg |
| 66 | +export ROCPROFSYS_CONFIG_FILE=~/.rocprofsys.cfg |
| 67 | +``` |
| 68 | + |
| 69 | +Second, inspect configuration file, possibly changing some variables. For example, one can modify the following lines: |
| 70 | + |
| 71 | +``` |
| 72 | +ROCPROFSYS_PROFILE = true |
| 73 | +ROCPROFSYS_USE_ROCTX = true |
| 74 | +ROCPROFSYS_SAMPLING_CPUS = 0 |
| 75 | +``` |
| 76 | + |
| 77 | +You can see what flags can be included in the config file by doing: |
| 78 | + |
| 79 | +``` |
| 80 | +rocprof-sys-avail --categories rocprofsys |
| 81 | +``` |
| 82 | + |
| 83 | +To add brief descriptions, use the `-bd` option: |
| 84 | + |
| 85 | +``` |
| 86 | +rocprof-sys-avail -bd --categories rocprofsys |
| 87 | +``` |
| 88 | + |
| 89 | +Note that the list of flags displayed by the commands above may not include all actual flags that can be set in the config. For a full list of options, please read the [rocprof-sys documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html). |
| 90 | + |
| 91 | +You can also create a configuration file with description per option. Beware, this is quite verbose: |
| 92 | + |
| 93 | +``` |
| 94 | +rocprof-sys-avail -G ~/rocprofsys_all.cfg --all |
| 95 | +``` |
| 96 | + |
| 97 | +## Instrument application binary |
| 98 | + |
| 99 | +You can instrument the binary, and inspect which functions were instrumented (note that you need to change `<TIMESTAMP>` according to your generated folder path). |
| 100 | + |
| 101 | +``` |
| 102 | +rocprof-sys-instrument -o ./Jacobi_hip.inst -- ./Jacobi_hip |
| 103 | +for f in $(ls rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/instrumentation/*.txt); do echo $f; cat $f; echo "##########"; done |
| 104 | +``` |
| 105 | + |
| 106 | +Currently `rocprof-sys` will instrument by default only the functions with >1024 instructions, so you may need to change it by using `-i #inst` or by adding `--function-include function_name` to select the functions you are interested in. Check more options using `rocprof-sys-instrument --help` or by reading the [rocprof-sys documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html). |
| 107 | + |
| 108 | +Let's instrument the most important Jacobi kernels. |
| 109 | + |
| 110 | +``` |
| 111 | +rocprof-sys-instrument --function-include 'Jacobi_t::Run' 'JacobiIteration' -o ./Jacobi_hip.inst -- ./Jacobi_hip |
| 112 | +``` |
| 113 | + |
| 114 | +The output should show that only these functions have been instrumented: |
| 115 | + |
| 116 | +``` |
| 117 | +... |
| 118 | +[rocprof-sys][exe] Finding instrumentation functions... |
| 119 | +[rocprof-sys][exe] 1 instrumented funcs in JacobiIteration.hip |
| 120 | +[rocprof-sys][exe] 1 instrumented funcs in JacobiRun.hip |
| 121 | +[rocprof-sys][exe] 1 instrumented funcs in Jacobi_hip |
| 122 | +... |
| 123 | +``` |
| 124 | + |
| 125 | +This can also be verified with: |
| 126 | + |
| 127 | +``` |
| 128 | +$ cat rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/instrumentation/instrumented.txt |
| 129 | +
|
| 130 | + StartAddress AddressRange #Instructions Ratio Linkage Visibility Module Function FunctionSignature |
| 131 | + 0x226440 332 71 4.68 unknown unknown JacobiIteration.hip JacobiIteration JacobiIteration |
| 132 | + 0x224ad0 677 146 4.64 unknown unknown JacobiRun.hip Jacobi_t::Run Jacobi_t::Run |
| 133 | + 0x226370 205 38 5.39 unknown unknown Jacobi_hip __device_stub__JacobiIterationKernel __device_stub__JacobiIterationKernel |
| 134 | +``` |
| 135 | + |
| 136 | +## Run instrumented binary |
| 137 | + |
| 138 | +Now that we have a new application binary where the most important functions are instrumented, we can profile it using `rocprof-sys-run` under the `mpirun` environment. |
| 139 | + |
| 140 | +``` |
| 141 | +mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1 |
| 142 | +``` |
| 143 | + |
| 144 | +Check the command line output generated by `rocprof-sys-run`, it contains some useful overviews and **paths to generated files**. Observe that the overhead to the application runtime is small. If you had previously set `ROCPROFSYS_PROFILE=true`, inspect `wall_clock-0.txt` which includes information on the function calls made in the code, such as how many times these calls have been called (`COUNT`) and the time in seconds they took in total (`SUM`). |
| 145 | + |
| 146 | +**In many cases, simply checking the wall_clock files might be sufficient for your profiling!** |
| 147 | + |
| 148 | +If it is not, continue by visualizing the trace. |
| 149 | + |
| 150 | +## Visualizing traces using `Perfetto` |
| 151 | + |
| 152 | +Copy generated `perfetto-trace-0.proto` file to your local machine, and using the Chrome browser open the web page [https://ui.perfetto.dev/](https://ui.perfetto.dev/): |
| 153 | + |
| 154 | +Click `Open trace file` and select the `perfetto-trace-0.proto` file. Below, you can see an example of how the trace file would be visualized on `Perfetto`: |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | +If there is an error opening trace file, try using an older `Perfetto` version, e.g., by opening the web page [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/). |
| 160 | + |
| 161 | +## Additional features |
| 162 | +### Flat profiles |
| 163 | + |
| 164 | +Append advanced option `ROCPROFSYS_FLAT_PROFILE=true` to `~/.rocprofsys.cfg` or prepend it to the `mpirun` command: |
| 165 | + |
| 166 | +``` |
| 167 | +ROCPROFSYS_FLAT_PROFILE=true mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1 |
| 168 | +``` |
| 169 | + |
| 170 | +`wall_clock-0.txt` file now shows overall time in seconds for each function. |
| 171 | + |
| 172 | +Note the significant total execution time for `hipMemcpy` and `Jacobi_t::Run` calls. |
| 173 | + |
| 174 | +### Hardware counters |
| 175 | + |
| 176 | +To see a list of all the counters for all the devices on the node, do: |
| 177 | + |
| 178 | +``` |
| 179 | +rocprof-sys-avail --all |
| 180 | +``` |
| 181 | + |
| 182 | +Select the counter you are interested in, and then declare them in your configuration file (or prepend to your `mpirun` command): |
| 183 | + |
| 184 | +``` |
| 185 | +ROCPROFSYS_ROCM_EVENTS = VALUUtilization,FetchSize |
| 186 | +``` |
| 187 | + |
| 188 | +Run the instrumented binary, and you will observe an output file for each hardware counter specified. You should also see a row for each hardware counter in the `Perfetto` trace generated by `rocprof-sys`. |
| 189 | + |
| 190 | +Note that you do not have to instrument again after making changes to the config file. Just running the instrumented binary picks up the changes. |
| 191 | + |
| 192 | +``` |
| 193 | +ROCPROFSYS_ROCM_EVENTS=VALUUtilization,FetchSize mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1 |
| 194 | +cat rocprof-sys-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.txt |
| 195 | +``` |
| 196 | + |
| 197 | +### Sampling |
| 198 | + |
| 199 | +To reduce the overhead of profiling, one can use call stack sampling. Set the following in your configuration file (or prepend to your `mpirun` command): |
| 200 | + |
| 201 | +``` |
| 202 | +ROCPROFSYS_USE_SAMPLING = true |
| 203 | +ROCPROFSYS_SAMPLING_FREQ = 100 |
| 204 | +``` |
| 205 | + |
| 206 | +Execute the instrumented binary, inspect `sampling*` files and visualize the `Perfetto` trace: |
| 207 | + |
| 208 | +``` |
| 209 | +mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1 |
| 210 | +ls rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/* | grep sampling |
| 211 | +``` |
| 212 | + |
| 213 | +### Profiling multiple MPI processes |
| 214 | + |
| 215 | +Run the instrumented binary with multiple MPI ranks. Note separate output files for each rank, including `perfetto-trace-*.proto` and `wall_clock-*.txt` files. |
| 216 | + |
| 217 | +``` |
| 218 | +mpirun -np 2 rocprof-sys-run -- ./Jacobi_hip.inst -g 2 1 |
| 219 | +``` |
| 220 | + |
| 221 | +Inspect output text files. Then visualize `perfetto-trace-*.proto` files in `Perfetto`. Note that one can merge multiple trace files into a single one using simple concatenation: |
| 222 | + |
| 223 | +``` |
| 224 | +cat perfetto-trace-*.proto > merged.proto |
| 225 | +``` |
| 226 | + |
| 227 | +## Next steps |
| 228 | + |
| 229 | +Try to use `rocprof-sys` to profile [GhostExchange examples](https://github.com/amd/HPCTrainingExamples/tree/main/MPI-examples/GhostExchange). |
| 230 | + |
| 231 | +**Finally, try to profile your own application!** |
0 commit comments