-
Notifications
You must be signed in to change notification settings - Fork 6
Performance
The Compadre toolkit takes advantage of on-node parallelism to efficiently solve many parallel quadratic programs, formulated using Generalized Moving Least Squares (GMLS).
We now use a KokkosKernels implementation of QR+Pivoting, removing any reliance on LAPACK or CUSOLVER + CUBLAS. The thread-spawning issue inherent in some versions of LAPACK is therefore no longer an issue.
At the outermost level of parallelism, you can modify command line arguments like --kokkos-num-threads=16
, etc... or with environment variable such as OMP_NUM_THREADS=4 ./your_application
or KOKKOS_NUM_THREADS=4 ./your_application
.
Within Compadre, three levels of hierarchical parallelism are used (team, thread, and vector lanes). For a CPU, the default is to use all parallelism for teams, with 1 thread per team, and 1 vector lane per thread. For GPU, the default is 16 threads per team, with 8 vector lanes per thread, and the number of teams being the concurrency divided by 16*8.
These settings can be modified for performance using environmental variables THREADS=16 VECTORLANES=8 ./your_application
.