-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathlinear_solvers.tex
195 lines (147 loc) · 20.9 KB
/
linear_solvers.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
% !TEX root = main.tex
%\todo{Describe new linear solver packages, ShyLU, FROSch, new packages similar to old packages, new capabilities in these packages (two level DD), GPU supported features}
%
%\todo{@Siva/Alexander: do this!}
Trilinos offers many linear solver capabilities: dense and sparse direct solvers, iterative solvers, shared-memory preconditioners local to a compute node, and scalable distributed memory domain decomposition and multigrid methods. Furthermore, interfaces to several third-party direct solvers are provided. All of the native solver capabilities are built on top of Kokkos and are GPU capable to varying degrees; any exceptions are noted in the detailed descriptions below.
\subsection{Iterative Linear Solvers and Krylov Methods: Belos}
Belos~\cite{Bavier2012a} provides next-generation iterative linear solvers and a powerful developer framework for solving linear systems and least-squares problems. This framework provides several abstractions that facilitate code reuse and extensibility.
First, Belos decouples the algorithms from the implementation of the underlying linear algebra object using traits mechanisms. While concrete linear algebra adapters are provided for Tpetra and Thyra, users can also implement their own interfaces to leverage any existing investment in their description of matrices and vectors. Second, there are abstractions to orthogonalization to ease the integration of application or architecture-specific orthogonalization methods. Implementations of iterated classical Gram-Schmidt, DGKS-corrected classical Gram-Schmidt, and iterated modified Gram-Schmidt are provided. Finally, powerful solver managers encapsulate the strategy for solving a linear system or least-squares problem using abstract interfaces to iteration kernels. As a result, the algorithms developed using Belos abstractions can be relatively agnostic to data layout in memory or distributed over processors and parallel matrix/vector operations.
Belos supersedes the AztecOO package~\cite{Heroux2004a} providing solver managers for single-vector iterative solvers, but also extensions of these methods to block iterative methods for solving linear systems with multiple right-hand sides. Single-vector solver managers include conjugate gradient (CG), minimum residual (MINRES), generalized minimal residual (GMRES), stabilized biconjugate gradient (BiCGStab), and transpose-free QMR (TFQMR) as well as ``seed'' solvers (hybrid GMRES, PCPG), subspace recycling solvers (GCRO-DR, RCG), and least-squares solvers (LSQR). There are versions of CG, GMRES, and GCRO-DR that construct a block Krylov subspace to solve for one or more right-hand sides. There are also ``pseudo-block'' versions of CG, GMRES, and TFQMR that apply the single-vector iteration simultaneously on multiple right-hand sides, where matrix and preconditioner applications are aggregated to achieve better performance. Any single-vector iteration can be used to solve multiple right-hand sides as well but, if a block or pseudo-block version is not used, the solver will sequentially solve the right-hand sides.
In recent years, application or architecture-specific variants of the classic iterative methods have been integrated into Belos. Some of these variations are so minor that they are enabled through input parameters on the classic method. For instance, flexible GMRES~\cite{Saad1993a} is an option for the block GMRES solver, and pipelined CG~\cite{GHYSELS2014224} is an option for both the block CG and pseudo-block CG solver. Additionally, there are stand-alone solvers for a pseudo-block stochastic CG method~\cite{Parker2012SamplingGD} and a fixed-point iteration method.
\subsection{One-Level Domain Decomposition and Basic Iterative Methods: Ifpack2}
Trilinos provides domain decomposition approaches in two different
packages: Ifpack2 and ShyLU (specifically the ShyLU\_DD
subpackage). Ifpack2 implements overlapping additive Schwarz
approaches with several options for the local subdomain solves. The
local subdomain solvers may be either CPU-only versions of incomplete
factorization preconditioners implemented in Ifpack2 itself, such as
ILU(k) and ILUt (thresholded ILU), or architecture portable algorithms
for incomplete factorizations and triangular solvers implemented in
Kokkos Kernels. It is also possible to use direct solvers as subdomain
solvers, and ShyLU also supports shared-memory inexact
incomplete factorization preconditioners.
%
Such one-level preconditioners can be used as smoothers within multigrid
methods, for solving ``simpler'' problems where the setup cost of a
more robust multilevel methods is prohibitive when compared to the
reduction in the number of iterations, or when the underlying problem
is simply not amenable to multilevel methods.
Ifpack2 also supplies classic iterative methods based on matrix-splitting techniques, such as Jacobi iteration, Gauss-Seidel, and an MPI-oriented hybrid of Jacobi and Gauss-Seidel (e.g., Jacobi between ranks and Gauss-Seidel on them). Ifpack2 also provides preconditioners
based on Chebyshev iterations. The aforementioned preconditioners are available both in point and block
forms and can operate on CSR and BSR matrices. In the block case,
line relaxation is also supported, while in the point case, techniques
like Vanka relaxation \cite{Vanka1986} are possible. Auxiliary-space
smoothing for $H(curl)$ and $H(div)$ discretizations of the style of
Hiptmair \cite{Hiptmair1997} are also supported.
Local kernels are either implemented in Ifpack2 itself
but can also be called from Kokkos Kernels for portable shared-memory algorithmic variants of Gauss-Seidel and Jacobi iterations.
\subsection{Multilevel Domain Decomposition Methods: FROSch}
\label{ssec:frosch}
FROSch (Fast and Robust Overlapping Schwarz) is a framework for the construction of multilevel Schwarz domain decomposition preconditioners. Besides parallel scalability, FROSch emphasizes applicability and robustness across a wide range of challenging problems while supporting an algebraic construction. Specifically, most preconditioners can be built using only the fully assembled system matrix, though some variants can take advantage of additional geometric inputs. The algebraic construction is enabled by the creation of an overlapping domain decomposition on the first level based on the sparsity pattern of the system matrix, similar to Ifpack2, along with the incorporation of extension-based coarse spaces, such as in the classical two-level generalized Dryja--Smith--Widlund (GDSW) preconditioner~\cite{dohrmann_domain_2008} and related variants.
While the initial version of FROSch~\cite{heinlein_parallel_2016} was based on the outdated Epetra linear algebra framework, the current implementation~\cite{heinlein_frosch_2020} leverages Xpetra. Originally designed as a lightweight wrapper around Epetra and Tpetra, the Xpetra package facilitated over the years compatibility with both the Epetra and Tpetra stacks. Now that the Epetra stack is phased out, it provides access to only the Tpetra stack. Algorithmic variants of Schwarz methods implemented in FROSch include:
\begin{itemize}
\item \emph{Extension-based coarse spaces based on a partition of unity on the interface}, such as classical GDSW, reduced dimension GDSW (RGDSW) coarse spaces, and multiscale finite element method (MsFEM) coarse spaces; cf.~\cite{heinlein_parallel_2016,heinlein_improving_2018};
\item \emph{Monolithic Schwarz preconditioners} for block systems; cf.~\cite{heinlein_monolithic_2019}.
\item \emph{Multilevel Schwarz preconditioners}, which are obtained from two-level Schwarz preconditioners by recursively applying Schwarz preconditioners as an inexact solver for the coarse problems; cf.~\cite{heinlein_parallel_2022}.
\end{itemize}
FROSch has been applied to various challenging application problems, including scalar elliptic and elasticity problems~\cite{heinlein_parallel_2016}, possibly with heterogeneities~\cite{alves2024computationalstudyalgebraiccoarse}, computational fluid dynamics problems~\cite{heinlein_monolithic_2019}, time-harmonic Maxwell's and fluid-structure interaction problems~\cite{heinlein2024couplingdealiifroschsustainable}, pharmaco-mechanical interactions in arterial walls~\cite{balzani_computational_nodate}, and coupled multiphysics problems for land ice simulations~\cite{heinlein_frosch_2022}; the latter three have been solved using monolithic preconditioning techniques. To extend robustness for heterogeneous model problems, an implementation of spectral coarse spaces~\cite{heinlein_adaptive_2019} is currently under development. FROSch preconditioners have scaled to more than \num{200000} cores on the Theta Cray XC40 supercomputer at the Argonne Leadership Computing Facility (ALCF); cf.~\cite{heinlein_parallel_2022}.
In its current implementation, FROSch assumes a one-to-one correspondence of subdomains and MPI ranks. Using an interface to the other solver packages in Trilinos, inexact subdomain solvers can be employed on subdomains. An extension to multiple subdomains per MPI rank is currently being implemented. Using Kokkos and Kokkos Kernels, FROSch has recently also been ported to GPUs~\cite{yamazaki_experimental_2023} with performance gains for the triangular solve or inexact solves with ILU on GPUs.
A demo/tutorial for FROSch can be found at the GitHub repository~\cite{frosch_demo}.
\subsection{Multigrid Methods: MueLu}
MueLu is a flexible and scalable high-performance multigrid solver library.
It provides a variety of multigrid algorithms for problems ranging from Poisson-like operators, over elasticity, convection-diffusion, Navier-Stokes, and Maxwell's equations,
all the way to multigrid methods for coupled multiphysics systems.
Besides its strong focus on aggregation-based algebraic multigrid (AMG) methods,
MueLu comes with specialized capabilities for (semi-)structured grids to perform semi-coarsening along grid lines
while forming the coarse operator via a Galerkin product (in contrast to classical geometric multigrid methods).
MueLu is extensible and allows for the research and development of new multigrid preconditioning methods.
Its weak and strong scalability up to \num{131000} cores of a Cray XC40 and one million cores of a Blue Gene/Q system, even for vector-valued partial differential equations (PDEs) on unstructured meshes, have been shown in~\cite{Lin2017a,Thomas2019a}.
MueLu provides several approaches to constructing and solving the multilevel problem:
\begin{itemize}
\item \emph{Algebraic smoothed aggregation approach}~\cite{Vanek1996a}:
The matrix graph is colored to create aggregates (groups) of nodes.
These aggregates define a tentative projection operator.
A final projection operator is created by applying a smoother to the tentative operator.
There are a variety of deterministic and non-deterministic coloring algorithms implemented directly
in MueLu or in Kokkos Kernels. For a full description, see~\cite{BergerVergiat2023a}.
\item \emph{Algebraic multigrid for Maxwell's equations}:
MueLu implements specialized solvers for the solution of curl-curl problems~\cite{BochevHuEtAl2008_AlgebraicMultigridApproachBased}.
Scaling results on Haswell, KNL, ARM and NVIDIA V100 GPUs at full machine scale can be found in~\cite{BettencourtBrownEtAl2021_EmpirePic}.
\item \emph{Multigrid for multiphysics problems}:
MueLu implements a tool box to compile multi-level block preconditioners for block matrices arising from coupled multiphysics problems.
Applications range from Navier-Stokes equations
over surface-coupling (as in fluid/structure interaction or contact mechanics~\cite{Wiesner2021a})
to volume-coupled problems (e.g. in magneto-hydro dynamics~\cite{Ohm2022a}).
\item \emph{Semi-coarsening algebraic multigrid approach}~\cite{Tuminaro2016a}:
Specialized aggregation procedures for three-dimensional meshes generated by extrusion of a two-dimensional unstructured grid
allow to first coarsen in the direction of extrusion to reduce the system to a two-dimensional representation and then perform classical aggregation-based AMG
for the remaining coarsenings.
\item \emph{AMG for (semi-)structured grids}:
MueLu has a structured aggregation capability in which the user may specify the coarsening rate used to compute interpolation operators.
The coarse operators are then formed via a Galerkin product to avoid remeshing on the coarse levels.
This work has been extended to semi-structured grids to leverage structured-grid computational performance also for globally unstructured grids~\cite{Mayr2022a}.
\item \emph{Geometric multigrid / matrix-free capabilities}:
MueLu treats many of the objects in its hierarchy as operators instead of matrices when possible.
This enables one to use MueLu in a matrix-free fashion provided the user can supply their own grid transfers and coarse operators.
% through factories such as \texttt{MatrixFreeTentativePFactory},
% which only requires aggregates in order to perform a grid transfer using a tentative prolongator.
% Other grid transfers such as structured grid transfers discussed above may also be performed in a matrix-free fashion,
% and simple smoothers such as Jacobi and Chebyshev may also be performed on matrix-free operators.
\end{itemize}
MueLu has a required dependency on Kokkos,
and the main code paths in MueLu (e.g. smoothed aggregation, semi-coarsening, $p$-coarsening, structured aggregation)
utilize Kokkos and have been shown to scale on device \cite{BettencourtBrownEtAl2021_EmpirePic}.
Some of the earlier algorithms such as energy minimization~\cite{Sala2008a} and the variable degree-of-freedom Laplacian have not been adapted to fully utilize Kokkos yet,
however, they are part of ongoing efforts to use Kokkos throughout MueLu completely and to merge serial and device-capable algorithms.
There are a variety of resources to learn more about MueLu.
An overview is given on the MueLu website\footnote{\url{https://trilinos.github.io/muelu.html}}.
The MueLu User's Guide~\cite{BergerVergiat2023a} summarizes installation instructions and a reference to most of MueLu's configuration parameters.
The MueLu Tutorial~\cite{Mayr2023b} introduces beginners and experts to various topics in MueLu and shows how to solve or precondition different linear systems using MueLu.
Details on the compatibility of MueLu and its predecessor ML~\cite{Heroux2005a,Gee2006a} can be found in the MueLu User's Guide~\cite{BergerVergiat2023a}.
Besides its C++ API, MueLu offers a MATLAB interface, MueMex, to provide access to MueLu's aggregation and solver routines from MATLAB.
MueMex allows users to setup and solve arbitrarily many problems with either MueLu as a preconditioner, Belos as a solver and Epetra or Tpetra for data structures.
While MueLu requires Teuchos, Tpetra, Xpetra, Kokkos, and Kokkos Kernels, it has a strong dependence
on the Amesos2, Ifpack2, and Zoltan2 libraries.
MueLu also supports interfaces to abstraction layer packages such as Stratimikos and Thyra through the MueLu adapters library.
These interfaces are required to use MueLu with the Teko block preconditioning package.
For more details on using MueLu with Teko, see the MueLu examples directory.
\subsection{Direct Linear Solvers: ShyLU and Amesos2}
ShyLU~\cite{ShyLUCore2014} provides an implementation of three solvers: Basker, Tacho, and a distributed-memory solver based on the Schur complement method.
Basker~\cite{Basker2017} is a sparse direct solver based on LU factorization for the problems that have the block triangular form (BTF) typically seen in circuit simulation applications. Basker uses these structures to factor and solve the diagonal blocks in parallel. The larger diagonal blocks can themselves be factored in parallel by discovering the parallelism available using a nested-dissection reordering. Basker focuses on exploiting thread-parallelism on the multi-core CPU architectures. %However, we have still implemented the solver using Kokkos.
Amesos2 also has a templated implementation of a sequential KLU solver called KLU2, which also exploits the BTF structure.
Tacho is a sparse direct solver that exploits supernodal block structure commonly found in sparse direct factorizations of matrices from mechanics applications. Tacho exploits this supernodal structure for both factorization and triangular-solve phases. It is based on Kokkos. Originally, Tacho implemented a task-parallel Cholesky factorization of sparse symmetric positive definite (SPD) matrices~\cite{Tacho2018}. However, to improve its portability, it has been extended to compute the sparse factorization based on level-set scheduling. Moreover, its functionality has been extended to the computation of an LDLt factorization of symmetric indefinite matrices, as well as the computation of an LU factorization of general matrices with a symmetric sparsity structure.
In addition to their stand-alone use, the aforementioned node-level solvers may be used as the local solvers for domain decomposition preconditioners (Ifpack2 or FROSch) or as the coarse solvers for multilevel preconditioners (MueLu or FROSch).
ShyLU also implements a distributed-memory linear solver based on the Schur complement method~\cite{ShyLUCore2014}. This is a hybrid direct and iterative solver where each subdomain problem is solved using a direct solver in parallel while the Schur complement is solved using an iterative approach. The preconditioner for the Schur complement solver is computed using a probing approach or using a threshold-based dropping strategy. This solver was developed to address the requirements of circuit simulation applications. The solver is also hybrid in the parallel computing sense as it uses MPI+threads. This solver algorithm is not focused on GPU architectures. As a result, we have not implemented this solver using Kokkos. % This solver has been shown to be useful for circuit simulation applications.
Amesos2~\cite{Bavier2012a} provides a uniform interface to direct linear solvers.
These include third-party direct solvers such as CHOLMOD, MUMPS, Pardiso\_MKL, SuperLU, SuperLU\_MT, SuperLU\_Dist, and STRUMPACK.
Amesos2 also provides an interface to the on-node sparse direct solvers implemented in ShyLU.
\subsection{Physics Block Operators and Preconditioners: Teko}
\label{sec:teko}
The Teko library~\cite{Cyr2016a} provides interfaces for operators and preconditioners that are constructed from large physics-based sub-blocks.
The sub-blocks are Thyra operators which are themselves often implemented using Tpetra matrices.
Teko provides generic block preconditioning strategies such as $2\times2$ Block-LU factorizations and the Jacobi and Gauss-Seidel strategies for any number of blocks,
as well as commonly used approximate inverse strategies for the Navier-Stokes equation
such as SIMPLEC, LSC and PCD~\cite{CyrShadidEtAl2012_StabilizationScalableBlockPreconditioning}.
The Block-LU class offers the possibility to implement user-specific inverse approximation strategies,
for example using a sparse approximate inverse to construct the Schur complement~\cite{Firmbach2024a}.
More complicated multilevel hierarchies of block solvers can be generated via \code{Teuchos::ParameterList} objects.
Block preconditioners for first-order formulations of Maxwell's equations and Darcy flow are implemented in the Panzer package.
Teko's solvers can be registered in the Stratimikos interface for usage entirely driven by \code{Teuchos::ParameterList} objects.
\subsection{Eigensolvers: Anasazi}
Anasazi~\cite{Baker2009a} is an extensible and interoperable framework for large-scale eigenvalue algorithms.
This framework provides a generic interface to a collection of algorithms that are built upon abstract interfaces
that facilitate code reuse and extensibility. Similar to Belos, Anasazi decouples the algorithms from the
implementation of the underlying linear algebra objects using traits mechanisms. Concrete linear algebra adapters
are provided for Tpetra and Thyra, while users can also implement their own interfaces to leverage any existing
investment in their description of matrices and vectors. Any libraries that understand Tpetra and Thyra matrices
and vectors, like Belos and Ifpack2, may also be used in conjunction with Anasazi. The suite of eigensolvers provided
by Anasazi includes locally optimal block preconditioned conjugate gradient (LOBPCG), block Davidson, Riemannian Trust-Region
(RTR), and block Krylov-Schur. Recently, there has been a family of trace minimization (TraceMin) methods and a
generalized Davidson method added to the suite of eigensolvers in Anasazi.
\subsection{Unified Solver Interface: Stratimikos}
The Stratimikos package offers a cohesive interface to various linear solvers and preconditioners within Trilinos, including Amesos2, Belos, FROSch, Ifpack2, MueLu, and Teko.
It requires that the matrix, right-hand side, and the solution vectors all support the Thyra interface.
Thyra also provides wrappers for Tpetra linear algebra.
Users can specify solver and preconditioner parameters via a \code{Teuchos::ParameterList}
which can be easily populated from an XML file.
Moreover, the package allows for the easy addition of new solver and preconditioner factories via adapters, with examples available in the \texttt{packages/stratimikos/adapters} directory.