Reduce number of calls to rocprof #384

benrichard-amd · 2024-07-15T23:48:19Z

Reduces number of calls to rocprof by improving perfmon coalescing.

coleramos425 · 2024-07-16T18:02:21Z

I reviewed the PR and in general, things look good. The one thing that I wanted to review closely was the Omniperf metrics that are explicitly dependent on SQ_*.csv (indicated by the coll_level field in yaml configs) to ensure results are still matching original implementation, i.e.

See: Code search results (github.com)

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

v1 (original code vs. original code)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/orig_run2/MI300A/ -b 11.2.10 13.2.5

v2 (original code vs. ben's modification)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/bens_run1/MI300A/ -b 11.2.10 13.2.5

coleramos425 · 2024-07-16T18:05:34Z

Other than the above inquiry my only other request before we merge this is that you sign off on commits per DCO requirements, e.g.

benrichard-amd · 2024-07-17T19:51:08Z

Hi @coleramos425,

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

I think this is run-to-run variation. Might depend on the workload. I ran the same experiment using the occupancy.hip sample workload (MI300X):

original code vs original code:

original code vs ben's modification:

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Interleve TCC channel counters in putput file e.g. TCC_HIT[0] TCC_ATOMIC[0] ... TCC_HIT[1] TCC_ATOMIC[1] Signed-off-by: benrichard-amd <ben.richard@amd.com>

Omniperf analyze expects the accumulate files to be in SQ_*.csv files. Since these files also contain PMC counters (we are trying to fit as many counters into each file as possible to minimize runs), we need to include these SQ_*.csv files in pmc_perf.csv. Signed-off-by: benrichard-amd <ben.richard@amd.com>

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Ran into rocprof error: ROCProfiler: fatal error: input metric'TCC_EA0_RDREQ[16]' not supported on this hardware: gfx942 gfx942 has 16 channels, not 32. Signed-off-by: benrichard-amd <ben.richard@amd.com>

Signed-off-by: benrichard-amd <ben.richard@amd.com>

coleramos425 · 2024-07-18T17:47:34Z

LGTM

coleramos425 · 2024-07-18T19:23:21Z

For the record, in the event of a vanilla Omniperf profiling run (e.g. no IP block filtering) this PR reduces the num. of required application replays from 24 -> 15. In the study of mixbench profiling performance (below), I found this leads to a ~27% improvement.

Note: This test was ran using a production rocprofiler build. We can expect an even larger improvement when this is applied in combination with profiler performance enhancements in a future release.

	Runtime (sec)	# of req. application replays
Original code	32.28	24
This PR	23.67	15

CC: @koomie

…m-rel-6.2 (#422) * Improve perfmon coalescing Signed-off-by: benrichard-amd <ben.richard@amd.com> * Interleve TCC channel counters Signed-off-by: benrichard-amd <ben.richard@amd.com> * Remove duplicate normal counters Interleve TCC channel counters in putput file e.g. TCC_HIT[0] TCC_ATOMIC[0] ... TCC_HIT[1] TCC_ATOMIC[1] Signed-off-by: benrichard-amd <ben.richard@amd.com> * Save accumulate counters to SQ_ files Omniperf analyze expects the accumulate files to be in SQ_*.csv files. Since these files also contain PMC counters (we are trying to fit as many counters into each file as possible to minimize runs), we need to include these SQ_*.csv files in pmc_perf.csv. Signed-off-by: benrichard-amd <ben.richard@amd.com> * Update to work with rocprof v1 Signed-off-by: benrichard-amd <ben.richard@amd.com> * Remove unused method Signed-off-by: benrichard-amd <ben.richard@amd.com> * Set correct number of TCC channels for gfx942 Ran into rocprof error: ROCProfiler: fatal error: input metric'TCC_EA0_RDREQ[16]' not supported on this hardware: gfx942 gfx942 has 16 channels, not 32. Signed-off-by: benrichard-amd <ben.richard@amd.com> * Fix code formatting Signed-off-by: benrichard-amd <ben.richard@amd.com> --------- Signed-off-by: benrichard-amd <ben.richard@amd.com>

benrichard-amd requested review from koomie and coleramos425 as code owners July 15, 2024 23:48

benrichard-amd added 8 commits July 17, 2024 15:34

Improve perfmon coalescing

4c55b67

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Interleve TCC channel counters

68e8a7a

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Remove duplicate normal counters

6df5dee

Interleve TCC channel counters in putput file e.g. TCC_HIT[0] TCC_ATOMIC[0] ... TCC_HIT[1] TCC_ATOMIC[1] Signed-off-by: benrichard-amd <ben.richard@amd.com>

Update to work with rocprof v1

77ced24

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Remove unused method

473f252

Signed-off-by: benrichard-amd <ben.richard@amd.com>

Set correct number of TCC channels for gfx942

eb83d74

Ran into rocprof error: ROCProfiler: fatal error: input metric'TCC_EA0_RDREQ[16]' not supported on this hardware: gfx942 gfx942 has 16 channels, not 32. Signed-off-by: benrichard-amd <ben.richard@amd.com>

Fix code formatting

5eaed48

Signed-off-by: benrichard-amd <ben.richard@amd.com>

benrichard-amd force-pushed the fewer-runs branch from b2d124a to 5eaed48 Compare July 17, 2024 20:35

coleramos425 merged commit 3d4b48d into ROCm:dev Jul 18, 2024
8 of 10 checks passed

benrichard-amd deleted the fewer-runs branch July 18, 2024 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of calls to rocprof #384

Reduce number of calls to rocprof #384

benrichard-amd commented Jul 15, 2024

coleramos425 commented Jul 16, 2024

coleramos425 commented Jul 16, 2024

benrichard-amd commented Jul 17, 2024

coleramos425 commented Jul 18, 2024

coleramos425 commented Jul 18, 2024

Reduce number of calls to rocprof #384

Reduce number of calls to rocprof #384

Conversation

benrichard-amd commented Jul 15, 2024

coleramos425 commented Jul 16, 2024

v1 (original code vs. original code)

v2 (original code vs. ben's modification)

coleramos425 commented Jul 16, 2024

benrichard-amd commented Jul 17, 2024

coleramos425 commented Jul 18, 2024

coleramos425 commented Jul 18, 2024