-
Notifications
You must be signed in to change notification settings - Fork 67
2.6.0 release notes
A list of new features in Caliper v2.6.0.
Profile CUDA GPU activities such as memory copies and kernels with the cuda-activity-profile
and cuda-activity-report
configs. This example output for cuda-activity-report shows GPU time spent in various CUDA kernels:
$ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf
Path Kernel Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop
|- 17.068956 17.069917 0.239392 0.240725 1.402501
|- device_unpack_top_buffe~~le*, double*, int, int) 0.091051 0.092734
|- device_tea_leaf_ppcg_so~~ const*, double const*) 5.409844 5.419096
|- device_tea_leaf_ppcg_so~~t*, double const*, int) 5.316101 5.320777
|- device_pack_right_buffe~~le*, double*, int, int) 0.112455 0.113198
|- device_pack_top_buffer(~~le*, double*, int, int) 0.092634 0.092820
(..)
|- device_pack_bottom_buff~~le*, double*, int, int) 0.098929 0.099095
summary
|- 0.000881 0.000964 0.000010 0.000011 1.179024
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084
cudaMemcpy 0.000437 0.000457 0.000010 0.000011 2.376874
cudaLaunchKernel
|- 0.000324 0.000392
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084
While cuda-activity-report
prints human-readable data, the cuda-activity-profile
config produces a JSON or .cali file for processing with Hatchet or cali-query. Learn more about CUDA profiling in the CUDA profiling how-to.
Caliper v2.6.0 introduces basic support for profiling CPU-side OpenMP constructs like, parallel regions and workshare constructs, with the OpenMP tools interface (OMPT). Note that only OpenMP 5.1 compliant compilers like clang v9+ support OMPT. When OMPT support is available, Caliper provides the openmp-report
config. Here is example output showing the time spent in OpenMP workshare regions and barriers:
$ CALI_CONFIG=openmp-report ./caliper-openmp-example
Path #Threads Time (thread) Time (total) Work % Barrier % Time (work) Time (barrier)
main 0.005122 0.027660 85.969388 14.030612
work 4 0.005110 0.027572 85.969388 14.030612 0.011121 0.001815
Learn more about OpenMP profiling in the OpenMP profiling how-to.
There is a new API to write ConfigManager reports into user-defined C++ streams for MPI programs:
auto res = cali::make_collective_output_channel("runtime-report(profile.mpi)");
auto channel = res.first;
channel->start();
//...
channel->collective_flush(std::ostream, MPI_COMM_WORLD);
Find a more detailed example program here.
The new main_thread_only
ConfigManager option shows profiling data only from the program's main thread. Consider this program:
int main()
{
CALI_MARK_BEGIN("main");
#pragma omp parallel for
for (int i = 0; i < 42; ++i) {
CALI_MARK_BEGIN("parallel");
/* ... */
CALI_MARK_END("parallel");
}
CALI_MARK_END("main");
return EXIT_SUCCESS;
}
Caliper measures the time in the "parallel" region on each thread. Meanwhile, the "main" region is only visible on the main thread. Therefore, you'll find an "orphaned" entry with the time inside the "parallel" region from the OpenMP child threads in the report output:
$ CALI_CONFIG=runtime-report ./caliper-threads
Path Min time/rank Max time/rank Avg time/rank Time %
main 0.002054 0.002054 0.002054 8.955744
parallel 0.003233 0.003233 0.003233 14.096359
parallel 0.009773 0.009773 0.009773 42.611729
With the main_thread_only
option, Caliper only reports data from the main thread:
# CALI_CONFIG=runtime-report,main_thread_only
Path Min time/rank Max time/rank Avg time/rank Time %
main 0.001465 0.001465 0.001465 11.204589
parallel 0.003339 0.003339 0.003339 25.537285
The region.count
metric counts the number of times a Caliper region was called:
$ ./examples/apps cxx-example -P runtime-report,aggregate_across_ranks=false,region.count
Path Time (E) Time (I) Time % (E) Time % (I) Calls
main 0.000157 0.000993 7.822621 49.476831 1.000000
mainloop 0.000109 0.000813 5.430992 40.508221 5.000000
foo 0.000704 0.000704 35.077230 35.077230 4.000000
init 0.000023 0.000023 1.145989 1.145989 1.000000
Note that counts for Caliper regions which are hidden in the report output will be added to the surrounding region. The example above has hidden loop iteration annotations, which are added to the count of the "mainloop" region.
The roctx
service forwards Caliper regions to AMD rocprofiler as rocTX annotations:
$ CALI_SERVICES_ENABLE=roctx rocprof (...) ./app
You can load custom ConfigManager configuration recipes or options from JSON files. This example defines a new ConfigManager option "tot_ins" that adds the PAPI_TOT_INS
PAPI counter:
{
"options": [
{ "name" : "tot_ins",
"description" : "Instructions",
"category" : "metric",
"services" : [ "papi" ],
"config" : { "CALI_PAPI_COUNTERS": "PAPI_TOT_INS" },
"query" : [
{ "level": "local", "select": [ { "expr": "sum(sum#papi.PAPI_TOT_INS)", "as": "Instr." } ] },
{ "level": "cross", "select": [
{ "expr": "avg(sum#sum#papi.PAPI_TOT_INS)", "as": "Instr. (avg)" },
{ "expr": "sum(sum#sum#papi.PAPI_TOT_INS)", "as": "Instr. (total)" }
]
}
]
}
]
}
We can load this file using the load
command in a ConfigManager config string and use the option in compatible configs, like runtime-report
:
$ ./examples/apps/cxx-example -P "load(tot_ins.json),runtime-report,tot_ins"
Path Min time/rank Max time/rank Avg time/rank Time % Instr. (avg) Instr. (total)
main 0.000115 0.000115 0.000115 6.068602 207335.000000 207335.000000
mainloop 0.000102 0.000102 0.000102 5.382586 222597.000000 222597.000000
foo 0.000664 0.000664 0.000664 35.039578 76007.000000 76007.000000
init 0.000020 0.000020 0.000020 1.055409 32511.000000 32511.000000
You can also write entirely new config recipes. This is an advanced feature, reach out via the Github discussion page if you want to learn more.
The Caliper build system should find CUDA components like CUpti and NVTX automatically or simply through CUDA_TOOLKIT_ROOT_DIR
.