How can I tune parallelization with executors? #952

edocento · 2022-01-10T11:40:17Z

edocento
Jan 10, 2022

I don't fully understand how to tune parallelization parameters with different executors (for example, change number of processors for the OMP executor in order to study the scalability of an algorithm I wrote, or number of blocks used for the cuda executor).

Furthermore, I wonder if it exists an explanation on how certain operations are parallelized. For example, the apply method for the LinOp class: since a matrix (Csr matrix) or a vector (Dense matrix) inherit this method and an executor must be provided to their constructor, I would expect that apply is parallelized for them, but I have no clue on how this is performed.

Thank you in advance

Answered by upsj

Jan 10, 2022

You can control the number of processors used for OpenMP parallelization either with the OMP_NUM_THREADS environment variable or the omp_set_num_threads function inside your code. We rely on the OpenMP runtime or user to set a sensible number of cores. In most situations, that just means using as much parallelism as possible. With the exception of some cases where synchronization overheads or NUMA effects have a significant impact, using more threads also gives you better performance.
In CUDA, we do this tuning ourselves, generally using a thread-to-row or (sub)warp-to-row mapping for the kernels, with heuristic oversubscription parameters to ensure all SMs are busy, or calling cuBLAS and…

View full answer

upsj · 2022-01-10T12:03:24Z

upsj
Jan 10, 2022
Maintainer

You can control the number of processors used for OpenMP parallelization either with the OMP_NUM_THREADS environment variable or the omp_set_num_threads function inside your code. We rely on the OpenMP runtime or user to set a sensible number of cores. In most situations, that just means using as much parallelism as possible. With the exception of some cases where synchronization overheads or NUMA effects have a significant impact, using more threads also gives you better performance.
In CUDA, we do this tuning ourselves, generally using a thread-to-row or (sub)warp-to-row mapping for the kernels, with heuristic oversubscription parameters to ensure all SMs are busy, or calling cuBLAS and cuSPARSE functionality for things we don't implement ourselves. Our base assumption in most cases is that kernels are being executed one after the other, each tuned for utilizing the whole GPU as much as possible.

If you want to see how individual matrix-vector product kernels are parallelized, you can take a look at the omp/matrix/*_kernels.cpp, cuda/matrix/*_kernels.cu and common/cuda_hip/matrix/*_kernels.hpp.inc files.
For some cases, like Csr SpMV kernels, we select an appropriate subwarp size to achieve best possible utilization inside a warp, but this is again only a heuristic approach.
The kernel dispatch that leads from an apply(...) call to an actual CUDA kernel call is a bit complex, going through
apply(...) -> apply_impl(...) -> ... -> op = make_spmv(...) -> exec->run(op) -> op.run(exec) -> spmv(...), but for your question, probably only the last step is interesting, everything else is boilerplate to call the right operation with the right executor.

7 replies

upsj Jan 20, 2022
Maintainer

Using more than one device requires partitioning the data and transferring information from one GPU to the other, so this does not match our abstraction well. For that purpose, we are currently working on an MPI backend that uses one GPU per MPI rank and takes care of the necessary communication. It will most likely need a few more months before it is finalized, though.
device_reset doesn't really have a performance impact, allocation_mode is mostly useful if you need to work with data both on the CPU and GPU, but generally, you should expect the performance of unified memory to be at most as good as device memory, so if you don't need it, you probably shouldn't use it.

edocento Jan 20, 2022
Author

Using MPI to handle more than one GPU sounds very promising. As a user I can't wait to try it :)
Ok, in fact I tried different configurations, but I did not see remarkable differences. Thank you very much!

jordanpui Feb 17, 2023

Hi,
I am using OMP_NUM_THREADS to control the thread number of ginkgo. From top, I can see the cpu usage is about 800%, which meets expection. But using exec->get_num_cores() and exec->get_num_thread_per_core() gives me 48 and 1. I wonder if my enviorment variable is making effect correctly.

upsj Feb 17, 2023
Maintainer

It is quite possible that even with 48 cores, you only reach a limited amount of parallelism because of pressure on the memory controller. Also our Csr OpenMP SpMV implementation is not very well optimized. Running the application in an OpenMP-aware profiler like VTune should tell you more about utilization.

jordanpui Feb 20, 2023

It is quite possible that even with 48 cores, you only reach a limited amount of parallelism because of pressure on the memory controller. Also our Csr OpenMP SpMV implementation is not very well optimized. Running the application in an OpenMP-aware profiler like VTune should tell you more about utilization.

Sorry for missing some description. I set OMP_NUM_THREADS to 8 and get CPU usgae about 800% but exec->get_num_cores() and exec->get_num_thread_per_core() gives me 48 and 1. Does this seem normal to you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I tune parallelization with executors? #952

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How can I tune parallelization with executors? #952

edocento Jan 10, 2022

Replies: 1 comment · 7 replies

upsj Jan 10, 2022 Maintainer

upsj Jan 20, 2022 Maintainer

edocento Jan 20, 2022 Author

jordanpui Feb 17, 2023

upsj Feb 17, 2023 Maintainer

jordanpui Feb 20, 2023

edocento
Jan 10, 2022

Replies: 1 comment 7 replies

upsj
Jan 10, 2022
Maintainer

upsj Jan 20, 2022
Maintainer

edocento Jan 20, 2022
Author

upsj Feb 17, 2023
Maintainer