-
I don't fully understand how to tune parallelization parameters with different executors (for example, change number of processors for the OMP executor in order to study the scalability of an algorithm I wrote, or number of blocks used for the cuda executor). Furthermore, I wonder if it exists an explanation on how certain operations are parallelized. For example, the Thank you in advance |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
You can control the number of processors used for OpenMP parallelization either with the If you want to see how individual matrix-vector product kernels are parallelized, you can take a look at the |
Beta Was this translation helpful? Give feedback.
You can control the number of processors used for OpenMP parallelization either with the
OMP_NUM_THREADS
environment variable or theomp_set_num_threads
function inside your code. We rely on the OpenMP runtime or user to set a sensible number of cores. In most situations, that just means using as much parallelism as possible. With the exception of some cases where synchronization overheads or NUMA effects have a significant impact, using more threads also gives you better performance.In CUDA, we do this tuning ourselves, generally using a thread-to-row or (sub)warp-to-row mapping for the kernels, with heuristic oversubscription parameters to ensure all SMs are busy, or calling cuBLAS and…