Skip to content

Commit 8484627

Browse files
authored
adding rocprofiler-systems jacobi README based on Omnitrace (#93)
* adding rocprofiler-systems jacobi README based on Omnitrace * addressed all Gina's comments
1 parent c0222b6 commit 8484627

File tree

1 file changed

+231
-0
lines changed

1 file changed

+231
-0
lines changed

Diff for: rocprofiler-systems/Jacobi/README.md

+231
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# ROCm™ Systems Profiler aka `rocprof-sys`
2+
NOTE: extensive documentation on how to use `rocprof-sys` for the [GhostExchange examples](https://github.com/amd/HPCTrainingExamples/tree/main/MPI-examples/GhostExchange) is also available as `README.md` in this exercises repo. Here, we show how to use `rocprof-sys` tools considering the example in HPCTrainingExamples/HIP/jacobi.
3+
4+
In this series of examples, we will demonstrate profiling with `rocprof-sys` on a platform using an AMD Instinct™ MI250X GPU. ROCm 6.3.2 release includes the `rocprofiler-systems` packge that you can install.
5+
6+
Note that the focus of this exercise is on `rocprof-sys` profiler, not on how to achieve optimal performance on MI250X.
7+
8+
First, start by cloning HPCTrainingExamples repository and loading ROCm:
9+
10+
```
11+
git clone https://github.com/amd/HPCTrainingExamples.git
12+
```
13+
14+
## Environment setup
15+
16+
For this training, one requires recent ROCm (>=6.3) which contains `rocprof-sys`, as well as an MPI installation.
17+
18+
```
19+
module load rocm/6.3.2
20+
module load openmpi
21+
22+
```
23+
24+
## Build and run
25+
26+
No profiling yet, just check that the code compiles and runs correctly.
27+
28+
```
29+
cd HPCTrainingExamples/HIP/jacobi
30+
make
31+
mpirun -np 1 ./Jacobi_hip -g 1 1
32+
```
33+
34+
The above run should show output that looks like this:
35+
36+
```
37+
Topology size: 1 x 1
38+
Local domain size (current node): 4096 x 4096
39+
Global domain size (all nodes): 4096 x 4096
40+
Rank 0 selecting device 0 on host TheraC63
41+
Starting Jacobi run.
42+
Iteration: 0 - Residual: 0.022108
43+
Iteration: 100 - Residual: 0.000625
44+
Iteration: 200 - Residual: 0.000371
45+
Iteration: 300 - Residual: 0.000274
46+
Iteration: 400 - Residual: 0.000221
47+
Iteration: 500 - Residual: 0.000187
48+
Iteration: 600 - Residual: 0.000163
49+
Iteration: 700 - Residual: 0.000145
50+
Iteration: 800 - Residual: 0.000131
51+
Iteration: 900 - Residual: 0.000120
52+
Iteration: 1000 - Residual: 0.000111
53+
Stopped after 1000 iterations with residue 0.000111
54+
Total Jacobi run time: 1.2876 sec.
55+
Measured lattice updates: 13.03 GLU/s (total), 13.03 GLU/s (per process)
56+
Measured FLOPS: 221.51 GFLOPS (total), 221.51 GFLOPS (per process)
57+
Measured device bandwidth: 1.25 TB/s (total), 1.25 TB/s (per process)
58+
```
59+
60+
## `rocprof-sys` config
61+
62+
First, generate the `rocprof-sys` configuration file, and ensure that this file is known to `rocprof-sys`.
63+
64+
```
65+
rocprof-sys-avail -G ~/.rocprofsys.cfg
66+
export ROCPROFSYS_CONFIG_FILE=~/.rocprofsys.cfg
67+
```
68+
69+
Second, inspect configuration file, possibly changing some variables. For example, one can modify the following lines:
70+
71+
```
72+
ROCPROFSYS_PROFILE = true
73+
ROCPROFSYS_USE_ROCTX = true
74+
ROCPROFSYS_SAMPLING_CPUS = 0
75+
```
76+
77+
You can see what flags can be included in the config file by doing:
78+
79+
```
80+
rocprof-sys-avail --categories rocprofsys
81+
```
82+
83+
To add brief descriptions, use the `-bd` option:
84+
85+
```
86+
rocprof-sys-avail -bd --categories rocprofsys
87+
```
88+
89+
Note that the list of flags displayed by the commands above may not include all actual flags that can be set in the config. For a full list of options, please read the [rocprof-sys documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html).
90+
91+
You can also create a configuration file with description per option. Beware, this is quite verbose:
92+
93+
```
94+
rocprof-sys-avail -G ~/rocprofsys_all.cfg --all
95+
```
96+
97+
## Instrument application binary
98+
99+
You can instrument the binary, and inspect which functions were instrumented (note that you need to change `<TIMESTAMP>` according to your generated folder path).
100+
101+
```
102+
rocprof-sys-instrument -o ./Jacobi_hip.inst -- ./Jacobi_hip
103+
for f in $(ls rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/instrumentation/*.txt); do echo $f; cat $f; echo "##########"; done
104+
```
105+
106+
Currently `rocprof-sys` will instrument by default only the functions with >1024 instructions, so you may need to change it by using `-i #inst` or by adding `--function-include function_name` to select the functions you are interested in. Check more options using `rocprof-sys-instrument --help` or by reading the [rocprof-sys documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html).
107+
108+
Let's instrument the most important Jacobi kernels.
109+
110+
```
111+
rocprof-sys-instrument --function-include 'Jacobi_t::Run' 'JacobiIteration' -o ./Jacobi_hip.inst -- ./Jacobi_hip
112+
```
113+
114+
The output should show that only these functions have been instrumented:
115+
116+
```
117+
...
118+
[rocprof-sys][exe] Finding instrumentation functions...
119+
[rocprof-sys][exe] 1 instrumented funcs in JacobiIteration.hip
120+
[rocprof-sys][exe] 1 instrumented funcs in JacobiRun.hip
121+
[rocprof-sys][exe] 1 instrumented funcs in Jacobi_hip
122+
...
123+
```
124+
125+
This can also be verified with:
126+
127+
```
128+
$ cat rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/instrumentation/instrumented.txt
129+
130+
StartAddress AddressRange #Instructions Ratio Linkage Visibility Module Function FunctionSignature
131+
0x226440 332 71 4.68 unknown unknown JacobiIteration.hip JacobiIteration JacobiIteration
132+
0x224ad0 677 146 4.64 unknown unknown JacobiRun.hip Jacobi_t::Run Jacobi_t::Run
133+
0x226370 205 38 5.39 unknown unknown Jacobi_hip __device_stub__JacobiIterationKernel __device_stub__JacobiIterationKernel
134+
```
135+
136+
## Run instrumented binary
137+
138+
Now that we have a new application binary where the most important functions are instrumented, we can profile it using `rocprof-sys-run` under the `mpirun` environment.
139+
140+
```
141+
mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1
142+
```
143+
144+
Check the command line output generated by `rocprof-sys-run`, it contains some useful overviews and **paths to generated files**. Observe that the overhead to the application runtime is small. If you had previously set `ROCPROFSYS_PROFILE=true`, inspect `wall_clock-0.txt` which includes information on the function calls made in the code, such as how many times these calls have been called (`COUNT`) and the time in seconds they took in total (`SUM`).
145+
146+
**In many cases, simply checking the wall_clock files might be sufficient for your profiling!**
147+
148+
If it is not, continue by visualizing the trace.
149+
150+
## Visualizing traces using `Perfetto`
151+
152+
Copy generated `perfetto-trace-0.proto` file to your local machine, and using the Chrome browser open the web page [https://ui.perfetto.dev/](https://ui.perfetto.dev/):
153+
154+
Click `Open trace file` and select the `perfetto-trace-0.proto` file. Below, you can see an example of how the trace file would be visualized on `Perfetto`:
155+
156+
![jacobi_hip-perfetto_screenshot](https://hackmd.io/_uploads/BkgSH-E0A.png)
157+
158+
159+
If there is an error opening trace file, try using an older `Perfetto` version, e.g., by opening the web page [https://ui.perfetto.dev/v46.0-35b3d9845/#!/](https://ui.perfetto.dev/v46.0-35b3d9845/#!/).
160+
161+
## Additional features
162+
### Flat profiles
163+
164+
Append advanced option `ROCPROFSYS_FLAT_PROFILE=true` to `~/.rocprofsys.cfg` or prepend it to the `mpirun` command:
165+
166+
```
167+
ROCPROFSYS_FLAT_PROFILE=true mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1
168+
```
169+
170+
`wall_clock-0.txt` file now shows overall time in seconds for each function.
171+
172+
Note the significant total execution time for `hipMemcpy` and `Jacobi_t::Run` calls.
173+
174+
### Hardware counters
175+
176+
To see a list of all the counters for all the devices on the node, do:
177+
178+
```
179+
rocprof-sys-avail --all
180+
```
181+
182+
Select the counter you are interested in, and then declare them in your configuration file (or prepend to your `mpirun` command):
183+
184+
```
185+
ROCPROFSYS_ROCM_EVENTS = VALUUtilization,FetchSize
186+
```
187+
188+
Run the instrumented binary, and you will observe an output file for each hardware counter specified. You should also see a row for each hardware counter in the `Perfetto` trace generated by `rocprof-sys`.
189+
190+
Note that you do not have to instrument again after making changes to the config file. Just running the instrumented binary picks up the changes.
191+
192+
```
193+
ROCPROFSYS_ROCM_EVENTS=VALUUtilization,FetchSize mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1
194+
cat rocprof-sys-Jacobi_hip.inst-output/<TIMESTAMP>/rocprof-device-0-VALUUtilization-0.txt
195+
```
196+
197+
### Sampling
198+
199+
To reduce the overhead of profiling, one can use call stack sampling. Set the following in your configuration file (or prepend to your `mpirun` command):
200+
201+
```
202+
ROCPROFSYS_USE_SAMPLING = true
203+
ROCPROFSYS_SAMPLING_FREQ = 100
204+
```
205+
206+
Execute the instrumented binary, inspect `sampling*` files and visualize the `Perfetto` trace:
207+
208+
```
209+
mpirun -np 1 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1
210+
ls rocprofsys-Jacobi_hip.inst-output/<TIMESTAMP>/* | grep sampling
211+
```
212+
213+
### Profiling multiple MPI processes
214+
215+
Run the instrumented binary with multiple MPI ranks. Note separate output files for each rank, including `perfetto-trace-*.proto` and `wall_clock-*.txt` files.
216+
217+
```
218+
mpirun -np 2 rocprof-sys-run -- ./Jacobi_hip.inst -g 2 1
219+
```
220+
221+
Inspect output text files. Then visualize `perfetto-trace-*.proto` files in `Perfetto`. Note that one can merge multiple trace files into a single one using simple concatenation:
222+
223+
```
224+
cat perfetto-trace-*.proto > merged.proto
225+
```
226+
227+
## Next steps
228+
229+
Try to use `rocprof-sys` to profile [GhostExchange examples](https://github.com/amd/HPCTrainingExamples/tree/main/MPI-examples/GhostExchange).
230+
231+
**Finally, try to profile your own application!**

0 commit comments

Comments
 (0)