You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvbench is a great tool for generating profiles for libcudf. I've found that the --profile option with --run-once was a good starting point, but for many operations we need more than one run to generate clean NVTX regions in the profile. For IO operations the first run populates the OS page cache, and for other functions sometime we see lazy library loading CUDA_MODULE_LOADING=LAZY in the first run.
For this reason I've started to use --timeout 0.1 or similar when profiling nvbenchmarks. This approach works well for generating equally-sized blocks of NVTX regions for further analysis. However, this approach starts to break down if the benchmark times vary too much and it is tricky to align NVTX regions to benchmark conditions when the count of NVTX regions per benchmark varies a lot.
I wish there were a way to specify a particular count of benchmark runs. The CLI parameter could be something like --run-fixed 10. This would make it easy to automate profile analysis, and would enable us to index over the NVTX regions and identify the benchmark conditions.
The text was updated successfully, but these errors were encountered:
nvbench is a great tool for generating profiles for libcudf. I've found that the
--profile
option with--run-once
was a good starting point, but for many operations we need more than one run to generate clean NVTX regions in the profile. For IO operations the first run populates the OS page cache, and for other functions sometime we see lazy library loadingCUDA_MODULE_LOADING=LAZY
in the first run.For this reason I've started to use
--timeout 0.1
or similar when profiling nvbenchmarks. This approach works well for generating equally-sized blocks of NVTX regions for further analysis. However, this approach starts to break down if the benchmark times vary too much and it is tricky to align NVTX regions to benchmark conditions when the count of NVTX regions per benchmark varies a lot.I wish there were a way to specify a particular count of benchmark runs. The CLI parameter could be something like
--run-fixed 10
. This would make it easy to automate profile analysis, and would enable us to index over the NVTX regions and identify the benchmark conditions.The text was updated successfully, but these errors were encountered: