Skip to content

Performance

Paul Kuberry edited this page Dec 5, 2024 · 4 revisions

The Compadre toolkit takes advantage of on-node parallelism to efficiently solve many parallel quadratic programs, formulated using Generalized Moving Least Squares (GMLS).

We now use a KokkosKernels implementation of QR+Pivoting, removing any reliance on LAPACK or CUSOLVER + CUBLAS. The thread-spawning issue inherent in some versions of LAPACK is therefore no longer an issue.

Fine-grain control:

At the outermost level of parallelism, you can modify command line arguments like --kokkos-num-threads=16, etc... or with environment variable such as OMP_NUM_THREADS=4 ./your_application or KOKKOS_NUM_THREADS=4 ./your_application.

Within Compadre, three levels of hierarchical parallelism are used (team, thread, and vector lanes). For a CPU, the default is to use all parallelism for teams, with 1 thread per team, and 1 vector lane per thread. For GPU, the default is 16 threads per team, with 8 vector lanes per thread, and the number of teams being the concurrency divided by 16*8.

These settings can be modified for performance using environmental variables THREADS=16 VECTORLANES=8 ./your_application.