diff --git a/LICENSE.md b/LICENSE.md index 7a44e616..e41c5d88 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,6 +1,6 @@ MIT License -Copyright (C) 2018-2023 Advanced Micro Devices, Inc. All rights reserved. +Copyright (C) 2024 Advanced Micro Devices, Inc. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/docs/api.rst b/docs/api/api.rst similarity index 94% rename from docs/api.rst rename to docs/api/api.rst index 2072e390..28d31a33 100644 --- a/docs/api.rst +++ b/docs/api/api.rst @@ -1,10 +1,14 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + .. _api: -### -API -### +############# +API library +############# -This section provides a detailed list of the library API +This document provides the detailed API list. Host Utility Functions ====================== diff --git a/docs/api/backend.rst b/docs/api/backend.rst new file mode 100644 index 00000000..4a1cfd58 --- /dev/null +++ b/docs/api/backend.rst @@ -0,0 +1,90 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _backends: + +******** +Backends +******** + +The rocALUTION structure is embedded with the support for accelerator devices. It is recommended to use accelerators to decrease the computational time. +.. note:: Not all functions are ported and present on the accelerator backend. This limited functionality is natural, since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.). + +rocALUTION supports HIP-capable GPUs starting with ROCm 1.9. Due to its design, the library can be easily extended to support future accelerator technologies. Such an extension of the library will not affect the algorithms based on it. + +If a particular function is not implemented for the used accelerator, the library moves the object to the host and computes the routine there. In such cases, a warning message of level 2 is printed. For example, if the user wants to perform an ILUT factorization on the HIP backend which is currently unavailable, the library moves the object to the host, performs the routine there and prints the following warning message: + +:: + + *** warning: LocalMatrix::ILUTFactorize() is performed on the host + +Moving objects to and from the accelerator +========================================== + +All objects in rocALUTION can be moved to the accelerator and the host. + +.. doxygenfunction:: rocalution::BaseRocalution::MoveToAccelerator +.. doxygenfunction:: rocalution::BaseRocalution::MoveToHost + +.. code-block:: cpp + + LocalMatrix mat; + LocalVector vec1, vec2; + + // Perform matrix vector multiplication on the host + mat.Apply(vec1, &vec2); + + // Move data to the accelerator + mat.MoveToAccelerator(); + vec1.MoveToAccelerator(); + vec2.MoveToAccelerator(); + + // Perform matrix vector multiplication on the accelerator + mat.Apply(vec1, &vec2); + + // Move data to the host + mat.MoveToHost(); + vec1.MoveToHost(); + vec2.MoveToHost(); + +Asynchronous transfers +====================== + +The rocALUTION library also provides asynchronous transfer of data between host and HIP backend. + +.. doxygenfunction:: rocalution::BaseRocalution::MoveToAcceleratorAsync +.. doxygenfunction:: rocalution::BaseRocalution::MoveToHostAsync +.. doxygenfunction:: rocalution::BaseRocalution::Sync + +This can be done with :cpp:func:`rocalution::LocalVector::CopyFromAsync` and :cpp:func:`rocalution::LocalMatrix::CopyFromAsync` or with ``MoveToAcceleratorAsync()`` and ``MoveToHostAsync()``. These functions return immediately and perform the asynchronous transfer in background mode. The synchronization is done with ``Sync()``. + +When using the ``MoveToAcceleratorAsync()`` and ``MoveToHostAsync()`` functions, the object still points to its original location (i.e. host for calling ``MoveToAcceleratorAsync()`` and accelerator for ``MoveToHostAsync()``). The object switches to the new location after the ``Sync()`` function is called. + +.. note:: The objects should not be modified during an active asynchronous transfer to avoid the possibility of generating incorrect values after the synchronization. +.. note:: To use asynchronous transfers, enable the pinned memory allocation. Uncomment ``#define ROCALUTION_HIP_PINNED_MEMORY`` in ``src/utils/allocate_free.hpp``. + +Systems without accelerators +============================ + +rocALUTION provides full code compatibility on systems without accelerators. You can take the code from the GPU system, re-compile the same code on a machine without a GPU and it still provides the same results. Any calls to :cpp:func:`rocalution::BaseRocalution::MoveToAccelerator` and :cpp:func:`rocalution::BaseRocalution::MoveToHost` are ignored. + +Memory allocations +================== + +All data that is passed to and from rocALUTION uses the memory handling functions described in the code. By default, the library uses standard C++ ``new`` and ``delete`` functions for the host data. To change the default behavior, modify ``src/utils/allocate_free.cpp``. + +Allocation problems +------------------- + +If the allocation fails, the library reports an error and exits. To change this default behavior, modify ``src/utils/allocate_free.cpp``. + +Memory alignment +---------------- + +The library can also handle special memory alignment functions. This feature needs to be uncommented before the compilation process in ``src/utils/allocate_free.cpp``. + +Pinned memory allocation (HIP) +------------------------------ + +By default, the standard host memory allocation is realized using C++ ``new`` and ``delete``. For faster PCI-Express transfers on HIP backend, use pinned host memory. You can activate this by uncommenting the corresponding macro in ``src/utils/allocate_free.hpp``. diff --git a/docs/usermanual/precond.rst b/docs/api/precond.rst similarity index 68% rename from docs/usermanual/precond.rst rename to docs/api/precond.rst index e804a767..3e8b2b1b 100644 --- a/docs/usermanual/precond.rst +++ b/docs/api/precond.rst @@ -1,50 +1,65 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _preconditioners: + ############### Preconditioners ############### -In this chapter, all preconditioners are presented. All preconditioners support local operators. They can be used as a global preconditioner via block-jacobi scheme which works locally on each interior matrix. To provide fast application, all preconditioners require extra memory to keep the approximated operator. + +This document provides a category-wise listing of the preconditioners. All preconditioners support local operators. They can be used as a global preconditioner via block-jacobi scheme, which works locally on each interior matrix. To provide fast application, all preconditioners require extra memory to keep the approximated operator. .. doxygenclass:: rocalution::Preconditioner -Code Structure +Code structure ============== -The preconditioners provide a solution to the system :math:`Mz = r`, where either the solution :math:`z` is directly computed by the approximation scheme or it is iteratively obtained with :math:`z = 0` initial guess. -Jacobi Method +The preconditioners provide a solution to the system :math:`Mz = r`, where the solution :math:`z` is either directly computed by the approximation scheme or iteratively obtained with :math:`z = 0` initial guess. + +Jacobi method ============= + .. doxygenclass:: rocalution::Jacobi -.. note:: Damping parameter :math:`\omega` can be adjusted by :cpp:func:`rocalution::FixedPoint::SetRelaxation`. +.. note:: To adjust the damping parameter :math:`\omega`, use :cpp:func:`rocalution::FixedPoint::SetRelaxation`. + +(Symmetric) Gauss-Seidel or (S)SOR method +========================================== -(Symmetric) Gauss-Seidel / (S)SOR Method -======================================== .. doxygenclass:: rocalution::GS .. doxygenclass:: rocalution::SGS -.. note:: Relaxation parameter :math:`\omega` can be adjusted by :cpp:func:`rocalution::FixedPoint::SetRelaxation`. +.. note:: To adjust the relaxation parameter :math:`\omega`, use :cpp:func:`rocalution::FixedPoint::SetRelaxation`. -Incomplete Factorizations +Incomplete factorizations ========================= ILU --- + .. doxygenclass:: rocalution::ILU .. doxygenfunction:: rocalution::ILU::Set ILUT ---- + .. doxygenclass:: rocalution::ILUT .. doxygenfunction:: rocalution::ILUT::Set(double) .. doxygenfunction:: rocalution::ILUT::Set(double, int) IC --- +--- + .. doxygenclass:: rocalution::IC AI Chebyshev ============ + .. doxygenclass:: rocalution::AIChebyshev .. doxygenfunction:: rocalution::AIChebyshev::Set FSAI ==== + .. doxygenclass:: rocalution::FSAI .. doxygenfunction:: rocalution::FSAI::Set(int) .. doxygenfunction:: rocalution::FSAI::Set(const OperatorType&) @@ -52,55 +67,63 @@ FSAI SPAI ==== + .. doxygenclass:: rocalution::SPAI .. doxygenfunction:: rocalution::SPAI::SetPrecondMatrixFormat TNS === + .. doxygenclass:: rocalution::TNS .. doxygenfunction:: rocalution::TNS::Set .. doxygenfunction:: rocalution::TNS::SetPrecondMatrixFormat -MultiColored Preconditioners +MultiColored preconditioners ============================ + .. doxygenclass:: rocalution::MultiColored .. doxygenfunction:: rocalution::MultiColored::SetPrecondMatrixFormat .. doxygenfunction:: rocalution::MultiColored::SetDecomposition -MultiColored (Symmetric) Gauss-Seidel / (S)SOR +MultiColored (symmetric) Gauss-Seidel / (S)SOR ---------------------------------------------- + .. doxygenclass:: rocalution::MultiColoredGS .. doxygenclass:: rocalution::MultiColoredSGS .. doxygenfunction:: rocalution::MultiColoredSGS::SetRelaxation -.. note:: The preconditioner matrix format can be changed using :cpp:func:`rocalution::MultiColored::SetPrecondMatrixFormat`. +.. note:: To change the preconditioner matrix format, use :cpp:func:`rocalution::MultiColored::SetPrecondMatrixFormat`. -MultiColored Power(q)-pattern method ILU(p,q) +MultiColored power(q)-pattern method ILU(p,q) --------------------------------------------- + .. doxygenclass:: rocalution::MultiColoredILU .. doxygenfunction:: rocalution::MultiColoredILU::Set(int) .. doxygenfunction:: rocalution::MultiColoredILU::Set(int, int, bool) -.. note:: The preconditioner matrix format can be changed using :cpp:func:`rocalution::MultiColored::SetPrecondMatrixFormat`. +.. note:: To change the preconditioner matrix format, use :cpp:func:`rocalution::MultiColored::SetPrecondMatrixFormat`. -Multi-Elimination Incomplete LU +Multi-elimination incomplete LU =============================== + .. doxygenclass:: rocalution::MultiElimination .. doxygenfunction:: rocalution::MultiElimination::GetSizeDiagBlock .. doxygenfunction:: rocalution::MultiElimination::GetLevel .. doxygenfunction:: rocalution::MultiElimination::Set .. doxygenfunction:: rocalution::MultiElimination::SetPrecondMatrixFormat -Diagonal Preconditioner for Saddle-Point Problems +Diagonal preconditioner for saddle-point problems ================================================= + .. doxygenclass:: rocalution::DiagJacobiSaddlePointPrecond .. doxygenfunction:: rocalution::DiagJacobiSaddlePointPrecond::Set -(Restricted) Additive Schwarz Preconditioner +(Restricted) Additive Schwarz preconditioner ============================================ + .. doxygenclass:: rocalution::AS .. doxygenfunction:: rocalution::AS::Set .. doxygenclass:: rocalution::RAS -The overlapped area is shown in :numref:`AS`. +See the overlapped area in the figure below: .. _AS: .. figure:: ../data/AS.png @@ -109,12 +132,13 @@ The overlapped area is shown in :numref:`AS`. Example of a 4 block-decomposed matrix - Additive Schwarz with overlapping preconditioner (left) and Restricted Additive Schwarz preconditioner (right). -Block-Jacobi (MPI) Preconditioner +Block-Jacobi (MPI) preconditioner ================================= + .. doxygenclass:: rocalution::BlockJacobi .. doxygenfunction:: rocalution::BlockJacobi::Set -The Block-Jacobi (MPI) preconditioner is shown in :numref:`BJ`. +See the Block-Jacobi (MPI) preconditioner in the figure below: .. _BJ: .. figure:: ../data/BJ.png @@ -123,8 +147,9 @@ The Block-Jacobi (MPI) preconditioner is shown in :numref:`BJ`. Example of a 4 block-decomposed matrix - Block-Jacobi preconditioner. -Block Preconditioner +Block preconditioner ==================== + .. doxygenclass:: rocalution::BlockPreconditioner .. doxygenfunction:: rocalution::BlockPreconditioner::Set .. doxygenfunction:: rocalution::BlockPreconditioner::SetDiagonalSolver @@ -133,8 +158,8 @@ Block Preconditioner .. doxygenfunction:: rocalution::BlockPreconditioner::SetPermutation -Variable Preconditioner +Variable preconditioner ======================= + .. doxygenclass:: rocalution::VariablePreconditioner .. doxygenfunction:: rocalution::VariablePreconditioner::SetPreconditioner - diff --git a/docs/usermanual/solvers.rst b/docs/api/solvers.rst similarity index 78% rename from docs/usermanual/solvers.rst rename to docs/api/solvers.rst index 05522945..098355b5 100644 --- a/docs/usermanual/solvers.rst +++ b/docs/api/solvers.rst @@ -1,12 +1,20 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _solver-class: + ******* Solvers ******* -Code Structure +This document provides a category-wise listing of the solver APIs along with the information required to use them. + +Code structure ============== .. doxygenclass:: rocalution::Solver -It provides an interface for +It provides an interface for: .. doxygenfunction:: rocalution::Solver::SetOperator .. doxygenfunction:: rocalution::Solver::Build @@ -17,11 +25,11 @@ It provides an interface for .. doxygenfunction:: rocalution::Solver::MoveToHost .. doxygenfunction:: rocalution::Solver::MoveToAccelerator -Iterative Linear Solvers +Iterative linear solvers ======================== .. doxygenclass:: rocalution::IterativeLinearSolver -It provides an interface for +It provides an interface for: .. doxygenfunction:: rocalution::IterativeLinearSolver::Init(double, double, double, int) .. doxygenfunction:: rocalution::IterativeLinearSolver::Init(double, double, double, int, int) @@ -36,13 +44,13 @@ It provides an interface for .. doxygenfunction:: rocalution::IterativeLinearSolver::GetAmaxResidualIndex .. doxygenfunction:: rocalution::IterativeLinearSolver::GetSolverStatus -Building and Solving Phase +Building and solving phase ========================== -Each iterative solver consists of a building step and a solving step. During the building step all necessary auxiliary data is allocated and the preconditioner is constructed. After that, the user can call the solving procedure, the solving step can be called several times. +Each iterative solver consists of a building step and a solving step. During the building step all necessary auxiliary data is allocated and the preconditioner is constructed. You can now call the solving procedure, which can be called several times. -When the initial matrix associated with the solver is on the accelerator, the solver will try to build everything on the accelerator. However, some preconditioners and solvers (such as FSAI and AMG) need to be constructed on the host before they can be transferred to the accelerator. If the initial matrix is on the host and we want to run the solver on the accelerator then we need to move the solver to the accelerator as well as the matrix, the right-hand-side and the solution vector. +When the initial matrix associated with the solver is on the accelerator, the solver tries to build everything on the accelerator. However, some preconditioners and solvers (such as FSAI and AMG) must be constructed on the host before being transferred to the accelerator. If the initial matrix is on the host and you want to run the solver on the accelerator, then you need to move the solver to the accelerator, matrix, right-hand side, and solution vector. -.. note:: If you have a preconditioner associate with the solver, it will be moved automatically to the accelerator when you move the solver. +.. note:: If you have a preconditioner associated with the solver, it is moved automatically to the accelerator when you move the solver. .. code-block:: cpp @@ -94,24 +102,27 @@ When the initial matrix associated with the solver is on the accelerator, the so ls.Solve(rhs, &x); -Clear Function and Destructor +Clear function and destructor ============================= + The :cpp:func:`rocalution::Solver::Clear` function clears all the data which is in the solver, including the associated preconditioner. Thus, the solver is not anymore associated with this preconditioner. .. note:: The preconditioner is not deleted (via destructor), only a :cpp:func:`rocalution::Preconditioner::Clear` is called. -.. note:: When the destructor of the solver class is called, it automatically calls the *Clear()* function. Be careful, when declaring your solver and preconditioner in different places - we highly recommend to manually call the *Clear()* function of the solver and not to rely on the destructor of the solver. +.. note:: When the destructor of the solver class is called, it automatically calls the *Clear()* function. Be careful, when declaring your solver and preconditioner in different places - we highly recommend to manually call the *Clear()* function of the solver and not rely on the destructor of the solver. -Numerical Update +Numerical update ================ -Some preconditioners require two phases in the their construction: an algebraic (e.g. compute a pattern or structure) and a numerical (compute the actual values) phase. In cases, where the structure of the input matrix is a constant (e.g. Newton-like methods) it is not necessary to fully re-construct the preconditioner. In this case, the user can apply a numerical update to the current preconditioner and pass the new operator with :cpp:func:`rocalution::Solver::ReBuildNumeric`. If the preconditioner/solver does not support the numerical update, then a full :cpp:func:`rocalution::Solver::Clear` and :cpp:func:`rocalution::Solver::Build` will be performed. -Fixed-Point Iteration +Some preconditioners require two phases in the their construction: an algebraic (e.g. compute a pattern or structure) and a numerical (compute the actual values) phase. In cases, where the structure of the input matrix is a constant (e.g. Newton-like methods), it is not necessary to fully reconstruct the preconditioner. In this case, the user can apply a numerical update to the current preconditioner and pass the new operator with :cpp:func:`rocalution::Solver::ReBuildNumeric`. If the preconditioner/solver does not support the numerical update, then a full :cpp:func:`rocalution::Solver::Clear` and :cpp:func:`rocalution::Solver::Build` is performed. + +Fixed-Point iteration ===================== + .. doxygenclass:: rocalution::FixedPoint .. doxygenfunction:: rocalution::FixedPoint::SetRelaxation -Krylov Subspace Solvers +Krylov subspace solvers ======================= CG @@ -154,26 +165,31 @@ BiCGStab(l) .. doxygenclass:: rocalution::BiCGStabl .. doxygenfunction:: rocalution::BiCGStabl::SetOrder -Chebyshev Iteration Scheme +Chebyshev iteration scheme ========================== + .. doxygenclass:: rocalution::Chebyshev -Mixed-Precision Defect Correction Scheme +Mixed-precision defect correction scheme ======================================== + .. doxygenclass:: rocalution::MixedPrecisionDC -MultiGrid Solvers +MultiGrid solvers ================= -The library provides algebraic multigrid as well as a skeleton for geometric multigrid methods. The BaseMultigrid class itself is not constructing the data for the method. It contains the solution procedure for V, W and K-cycles. The AMG has two different versions for Local (non-MPI) and for Global (MPI) type of computations. + +The library provides algebraic multigrid and a skeleton for geometric multigrid methods. The ``BaseMultigrid`` class itself doesn't construct data for the method. It contains the solution procedure for V, W and K-cycles. The AMG has two different versions for Local (non-MPI) and for Global (MPI) type of computations. .. doxygenclass:: rocalution::BaseMultiGrid -Geometric MultiGrid +Geometric multiGrid ------------------- + .. doxygenclass:: rocalution::MultiGrid -Algebraic MultiGrid +Algebraic multiGrid ------------------- + .. doxygenclass:: rocalution::BaseAMG .. doxygenfunction:: rocalution::BaseAMG::BuildHierarchy .. doxygenfunction:: rocalution::BaseAMG::BuildSmoothers @@ -184,31 +200,35 @@ Algebraic MultiGrid .. doxygenfunction:: rocalution::BaseAMG::SetOperatorFormat .. doxygenfunction:: rocalution::BaseAMG::GetNumLevels -Unsmoothed Aggregation AMG +Unsmoothed aggregation AMG ========================== + .. doxygenclass:: rocalution::UAAMG .. doxygenfunction:: rocalution::UAAMG::SetCouplingStrength .. doxygenfunction:: rocalution::UAAMG::SetOverInterp -Smoothed Aggregation AMG +Smoothed aggregation AMG ======================== + .. doxygenclass:: rocalution::SAAMG .. doxygenfunction:: rocalution::SAAMG::SetCouplingStrength .. doxygenfunction:: rocalution::SAAMG::SetInterpRelax -Ruge-Stueben AMG +Ruge-stueben AMG ================ + .. doxygenclass:: rocalution::RugeStuebenAMG .. doxygenfunction:: rocalution::RugeStuebenAMG::SetCouplingStrength Pairwise AMG ============ + .. doxygenclass:: rocalution::PairwiseAMG .. doxygenfunction:: rocalution::PairwiseAMG::SetBeta .. doxygenfunction:: rocalution::PairwiseAMG::SetOrdering .. doxygenfunction:: rocalution::PairwiseAMG::SetCoarseningFactor -Direct Linear Solvers +Direct linear solvers ===================== .. doxygenclass:: rocalution::DirectLinearSolver .. doxygenclass:: rocalution::LU diff --git a/docs/design/clients.rst b/docs/design/clients.rst index 841f7143..9f099dcf 100644 --- a/docs/design/clients.rst +++ b/docs/design/clients.rst @@ -1,58 +1,65 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _clients: + ******* Clients ******* + rocALUTION clients host a variety of different examples as well as a unit test package. -For detailed instructions on how to build rocALUTION with clients, see :ref:`rocalution_building`. +For detailed instructions on how to build rocALUTION with clients, see :ref:`linux-installation` or :ref:`windows-installation`. Examples ======== The examples collection offers different possible set-ups of solvers and preconditioners. -The following tables gives a short overview on the different examples: +The following tables provide a quick overview of various examples: + +===================== ==== +Example Description +===================== ==== +``amg`` Algebraic Multigrid solver (smoothed aggregation scheme, GS smoothing) +``as-precond`` GMRES solver with Additive Schwarz preconditioning +``async`` Asynchronous rocALUTION object transfer +``benchmark`` Benchmarking important sparse functions +``bicgstab`` BiCGStab solver with multicolored Gauss-Seidel preconditioning +``block-precond`` GMRES solver with blockwise multicolored ILU preconditioning +``cg-amg`` CG solver with Algebraic Multigrid (smoothed aggregation scheme) preconditioning +``cg`` CG solver with Jacobi preconditioning +``cmk`` CG solver with ILU preconditioning using Cuthill McKee ordering +``direct`` Matrix inversion +``fgmres`` Flexible GMRES solver with multicolored Gauss-Seidel preconditioning +``fixed-point`` Fixed-Point iteration scheme using Jacobi relaxation +``gmres`` GMRES solver with multicolored Gauss-Seidel preconditioning +``idr`` Induced Dimension Reduction solver with Jacobi preconditioning +``key`` Sparse matrix unique key computation +``me-preconditioner`` CG solver with multi-elimination preconditioning +``mixed-precision`` Mixed-precision CG solver with multicolored ILU preconditioning +``power-method`` CG solver using Chebyshev preconditioning and power method for eigenvalue approximation +``simple-spmv`` Sparse Matrix Vector multiplication +``sp-precond`` BiCGStab solver with multicolored ILU preconditioning for saddle point problems +``stencil`` CG solver using stencil as operator +``tns`` CG solver with Truncated Neumann Series preconditioning +``var-precond`` FGMRES solver with variable preconditioning +===================== ==== ================= ==== -Example Description +Example (MPI) Description ================= ==== -amg Algebraic Multigrid solver (smoothed aggregation scheme, GS smoothing) -as-precond GMRES solver with Additive Schwarz preconditioning -async Asynchronous rocALUTION object transfer -benchmark Benchmarking important sparse functions -bicgstab BiCGStab solver with multicolored Gauss-Seidel preconditioning -block-precond GMRES solver with blockwise multicolored ILU preconditioning -cg-amg CG solver with Algebraic Multigrid (smoothed aggregation scheme) preconditioning -cg CG solver with Jacobi preconditioning -cmk CG solver with ILU preconditioning using Cuthill McKee ordering -direct Matrix inversion -fgmres Flexible GMRES solver with multicolored Gauss-Seidel preconditioning -fixed-point Fixed-Point iteration scheme using Jacobi relaxation -gmres GMRES solver with multicolored Gauss-Seidel preconditioning -idr Induced Dimension Reduction solver with Jacobi preconditioning -key Sparse matrix unique key computation -me-preconditioner CG solver with multi-elimination preconditioning -mixed-precision Mixed-precision CG solver with multicolored ILU preconditioning -power-method CG solver using Chebyshev preconditioning and power method for eigenvalue approximation -simple-spmv Sparse Matrix Vector multiplication -sp-precond BiCGStab solver with multicolored ILU preconditioning for saddle point problems -stencil CG solver using stencil as operator -tns CG solver with Truncated Neumann Series preconditioning -var-precond FGMRES solver with variable preconditioning +``benchmark_mpi`` Benchmarking important sparse functions +``bicgstab_mpi`` BiCGStab solver with multicolored Gauss-Seidel preconditioning +``cg-amg_mpi`` CG solver with Algebraic Multigrid (pairwise aggregation scheme) preconditioning +``cg_mpi`` CG solver with Jacobi preconditioning +``fcg_mpi`` Flexible CG solver with ILU preconditioning +``fgmres_mpi`` Flexible GMRES solver with SParse Approximate Inverse preconditioning +``global-io_mpi`` File I/O with CG solver and Factorized Sparse Approximate Inverse preconditioning +``idr_mpi`` IDR solver with Factorized Sparse Approximate Inverse preconditioning +``qmrcgstab_mpi`` QMRCGStab solver with ILU-T preconditioning ================= ==== -============= ==== -Example (MPI) Description -============= ==== -benchmark_mpi Benchmarking important sparse functions -bicgstab_mpi BiCGStab solver with multicolored Gauss-Seidel preconditioning -cg-amg_mpi CG solver with Algebraic Multigrid (pairwise aggregation scheme) preconditioning -cg_mpi CG solver with Jacobi preconditioning -fcg_mpi Flexible CG solver with ILU preconditioning -fgmres_mpi Flexible GMRES solver with SParse Approximate Inverse preconditioning -global-io_mpi File I/O with CG solver and Factorized Sparse Approximate Inverse preconditioning -idr_mpi IDR solver with Factorized Sparse Approximate Inverse preconditioning -qmrcgstab_mpi QMRCGStab solver with ILU-T preconditioning -============= ==== - Unit Tests ========== -Multiple unit tests are available to test for bad arguments, invalid parameters and solver and preconditioner functionality. -The unit tests are based on google test. -The tests cover a variety of different solver, preconditioning and matrix format combinations and can be performed on all available backends. +There are multiple unit tests available to test for bad arguments, invalid parameters, and solver and preconditioner functionality. +These unit tests are based on google test. +The tests cover a variety of solver, preconditioning, and matrix format combinations and can be performed on all available backends. diff --git a/docs/design/design.rst b/docs/design/design.rst index 435190ba..ae38fbf3 100644 --- a/docs/design/design.rst +++ b/docs/design/design.rst @@ -1,28 +1,35 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _design-philosophy: + ********************* -Design and Philosophy +Design and philosophy ********************* + rocALUTION is written in C++ and HIP. -The main idea of the rocALUTION objects is that they are separated from the actual hardware specification. -Once you declare a matrix, a vector or a solver they are initially allocated on the host (CPU). -Then, every object can be moved to a selected accelerator by a simple function call. -The whole execution mechanism is based on run-time type information (RTTI), which allows you to select where and how you want to perform the operations at run time. -This is in contrast to the template-based libraries, which need this information at compile time. +The rocALUTION objects are designed to be separate from the actual hardware specification. +Once you declare a matrix, a vector, or a solver, these rocALUTION objects are initially allocated on the host (CPU). +Then, every object can be moved to a selected accelerator using a simple function call. +The whole execution mechanism is based on the Run-Time Type Information (RTTI), which allows you to select the location and method for performing the operations at run-time. +This is in contrast to the template-based libraries that require this information at compile-time. -The philosophy of the library is to abstract the hardware-specific functions and routines from the actual program, that describes the algorithm. -It is hard and almost impossible for most of the large simulation software based on sparse computation, to adapt and port their implementation in order to use every new technology. -On the other hand, the new high performance accelerators and devices have the capability to decrease the computational time significantly in many critical parts. +The philosophy of the library is to abstract the hardware-specific functions and routines from the actual program that describes the algorithm. +It is difficult and almost impossible for most of the large simulation softwares based on sparse computation to adapt and port their implementation to suit every new technology. +On the other hand, the new high performance accelerators and devices can decrease the computational time significantly in many critical parts. -This abstraction layer of the hardware specific routines is the core of the rocALUTION design. +This abstraction layer of the hardware-specific routines is the core of the rocALUTION design. It is built to explore fine-grained level of parallelism suited for multi/many-core devices. -This is in contrast to most of the parallel sparse libraries available which are mainly based on domain decomposition techniques. -Thus, the design of the iterative solvers the preconditioners is very different. -Another cornerstone of rocALUTION is the native support of accelerators - the memory allocation, transfers and specific hardware functions are handled internally in the library. +This is in contrast to most of the parallel sparse libraries that are based mainly on domain decomposition techniques. +That's why the design of the iterative solvers and preconditioners is very different. +Another cornerstone of rocALUTION is the native support for accelerators where the memory allocation, transfers, and specific hardware functions are handled internally in the library. -rocALUTION helps you to use accelerator technologies but does not force you to use them. -Even if you offload your algorithms and solvers to the accelerator device, the same source code can be compiled and executed in a system without any accelerator. +rocALUTION doesn't make the use of accelerator technologies mandatory. +Even if you offload your algorithms and solvers on the accelerator device, the same source code can be compiled and executed on a system without an accelerator. Naturally, not all routines and algorithms can be performed efficiently on many-core systems (i.e. on accelerators). To provide full functionality, the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not, the object is moved to the host and the routine is computed there. -This guarantees that your code will run with any accelerator, regardless of the available functionality for it. +This ensures that your code runs on any accelerator, regardless of the available functionality for it. diff --git a/docs/design/designdoc.rst b/docs/design/designdoc.rst deleted file mode 100644 index c1614ca7..00000000 --- a/docs/design/designdoc.rst +++ /dev/null @@ -1,15 +0,0 @@ -.. _design_document: - -#################### -Design Documentation -#################### - -.. toctree:: - :maxdepth: 3 - :caption: Contents: - - design - orga - guides - functable - clients diff --git a/docs/design/functable.rst b/docs/design/functable.rst index d1221b80..4e9f5e80 100644 --- a/docs/design/functable.rst +++ b/docs/design/functable.rst @@ -1,15 +1,22 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _functionality-table: + ******************* -Functionality Table +Functionality table ******************* -The following tables give an overview whether a rocALUTION routine is implemented on host backend, accelerator backend, or both. +The following tables list the rocALUTION routines along with the information about the implementation location i.e. host backend, accelerator backend, or both. LocalMatrix and LocalVector classes =================================== + All matrix operations (except SpMV) require a CSR matrix. -.. note:: If the input matrix is not a CSR matrix, an internal conversion will be performed to CSR format, followed by a back conversion to the previous format after the operation. - In this case, a warning message on verbosity level 2 will be printed. +.. note:: If the input matrix is not a CSR matrix, an internal conversion is performed to CSR format, followed by a back conversion to the previous format after the operation. + In this case, a warning message on verbosity level 2 is printed. ==================================================================================== =============================================================================== ======== ======= **LocalMatrix function** **Comment** **Host** **HIP** @@ -147,7 +154,7 @@ All matrix operations (except SpMV) require a CSR matrix. :cpp:func:`Power ` Compute vector power Yes Yes ====================================================================================== ===================================================================== ======== ======= -Solver and Preconditioner classes +Solver and preconditioner classes ================================= .. note:: The building phase of the iterative solver also depends on the selected preconditioner. diff --git a/docs/design/guides.rst b/docs/design/guides.rst index ded12304..9b4a8a37 100644 --- a/docs/design/guides.rst +++ b/docs/design/guides.rst @@ -1,26 +1,36 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _functionality-extension: + ********************************** -Functionality Extension Guidelines +Functionality extension guidelines ********************************** -The main purpose of this chapter is to give an overview of different ways to implement user-specific routines, solvers or preconditioners to the rocALUTION library package. + +This document provides information about the different ways to implement user-specific routines, solvers, or preconditioners to the rocALUTION library package. Additional features can be added in multiple ways. -Additional solver and preconditioner functionality that uses already implemented backend functionality will perform well on accelerator devices without the need for expert GPU programming knowledge. -Also, users that are not interested in using accelerators will not be confronted with HIP and GPU related programming tasks to add additional functionality. +Additional solver and preconditioner functionality that uses the existing backend functionality performs well on accelerator devices without the need for expert GPU programming knowledge. +Also, those not interested in using accelerators are not required to perform HIP and GPU-related programming tasks to add additional functionality. In the following sections, different levels of functionality enhancements are illustrated. These examples can be used as guidelines to extend rocALUTION step by step with your own routines. -Please note, that user added routines can also be added to the main GitHub repository using pull requests. +Please note, that user-added routines can also be added to the main GitHub repository using pull requests. + +``LocalMatrix`` functionality extension +======================================== + +This section demonstrates how to extend the :cpp:class:`LocalMatrix ` class with an additional routine. +The routine supports both Host and Accelerator backend. +Furthermore, the routine requires the matrix to be in CSR format. +Here are the steps to extend the :cpp:class:`LocalMatrix ` functionality: -LocalMatrix Functionlity Extension -================================== -In this example, the :cpp:class:`LocalMatrix ` class is extended by an additional routine. -The routine shall support both, Host and Accelerator backend. -Furthermore, the routine requires the matrix to be in CSR format. +1. API enhancement +-------------------- -API Enhancement ---------------- -To make the new routine available by the API, we first need to modify the :cpp:class:`LocalMatrix ` class. -The corresponding header file `local_matrix.hpp` is located in `src/base/`. -The new routines can be added as public member function, e.g. +To make the new routine available through the API, modify the :cpp:class:`LocalMatrix ` class. +The corresponding header file ``local_matrix.hpp`` is located in ``src/base/``. +The new routines can be added as public member functions as shown below: .. code-block:: cpp @@ -33,9 +43,9 @@ The new routines can be added as public member function, e.g. virtual void ApplyAdd(const LocalVector& in, ... -For the implementation of the new API function, it is important to know where this functionality will be available. -To add support for any backend and matrix format, format conversions are required, if `MyNewFunctionality()` is only supported for CSR matrices. -This will be subject to the API function implementation: +For the implementation of the new API function, it is important to know the location of the availability of this functionality. +To add support for any backend and matrix format, format conversions are required if ``MyNewFunctionality()`` is only supported for CSR metrices. +This is subject to the API function implementation: .. code-block:: cpp @@ -115,16 +125,17 @@ This will be subject to the API function implementation: #endif } -Similarly, host-only functions can be implemented. -In this case, initial data explicitly need to be moved to the host backend by the API implementation. +Similarly, you can implement host-only functions. +In this case, initial data explicitly needs to be moved to the host backend using the API implementation. -The next step is the implementation of the actual functionality in the :cpp:class:`BaseMatrix ` class. +The next step is to implement the actual functionality in the :cpp:class:`BaseMatrix ` class. -Enhancement of the BaseMatrix class ------------------------------------ -To make the new routine available in the base class, we first need to modify the :cpp:class:`BaseMatrix ` class. -The corresponding header file `base_matrix.hpp` is located in `src/base/`. -The new routines can be added as public member function, e.g. +2. Enhancement of the ``BaseMatrix`` class +--------------------------------------------- + +To make the new routine available in the base class, first modify the :cpp:class:`BaseMatrix ` class. +The corresponding header file ``base_matrix.hpp`` is located in ``src/base/``. +The new routines can be added as public member functions, e.g. .. code-block:: cpp @@ -137,8 +148,8 @@ The new routines can be added as public member function, e.g. /// Perform LU factorization ... -We do not implement `MyNewFunctionality()` purely virtual, as we do not supply an implementation for all base classes. -We decided to implement it only for CSR format, and thus need to return an error flag, such that the :cpp:class:`LocalMatrix ` class is aware of the failure and can convert it to CSR. +We don't implement the purely virtual ``MyNewFunctionality()`` as we don't supply an implementation for all base classes. +We decided to implement it only for CSR format and hence need to return an error flag, so that the :cpp:class:`LocalMatrix ` class is aware of the failure and can convert it to CSR. .. code-block:: cpp @@ -148,11 +159,12 @@ We decided to implement it only for CSR format, and thus need to return an error return false; } -Platform-specific Host Implementation -````````````````````````````````````` -So far, our new function will always fail, as there is no backend implementation available yet. -To satisfy the rocALUTION host backup philosophy, we need to make sure that there is always a host implementation available. -This host implementation need to be placed in `src/base/host/host_matrix_csr.cpp` as we decided to make it available for CSR format. +3. Platform-specific host implementation +------------------------------------------- + +To satisfy the rocALUTION host backup philosophy, there must be a host implementation available. +Hence, for the new function to succeed, there must be backend implementation available. +Place the host implementation in ``src/base/host/host_matrix_csr.cpp`` as we decided to make it available for CSR format. .. code-block:: cpp @@ -171,7 +183,7 @@ This host implementation need to be placed in `src/base/host/host_matrix_csr.cpp { // Place some asserts to verify sanity of input data - // Our algorithm works only for squared matrices + // Our algorithm works only for squared metrices assert(this->nrow_ == this->ncol_); assert(this->nnz_ > 0); @@ -208,11 +220,12 @@ This host implementation need to be placed in `src/base/host/host_matrix_csr.cpp return true; } -Platform-specific HIP Implementation -```````````````````````````````````` -We can now add an additional implementation for the HIP backend, using HIP programming framework. -This will make our algorithm available on accelerators and rocALUTION will not switch to the host backend on function calls anymore. -The HIP implementation needs to be added to `src/base/hip/hip_matrix_csr.cpp` in this case. +4. Platform-specific HIP implementation +------------------------------------------ + +You can now add an additional implementation for the HIP backend using HIP programming framework. +This is required to make your algorithm available on accelerators so that rocALUTION doesn't need to switch to the host backend on function calls anymore. +Add the HIP implementation ``src/base/hip/hip_matrix_csr.cpp`` in this case. .. code-block:: cpp @@ -231,7 +244,7 @@ The HIP implementation needs to be added to `src/base/hip/hip_matrix_csr.cpp` in { // Place some asserts to verify sanity of input data - // Our algorithm works only for squared matrices + // Our algorithm works only for squared metrices assert(this->nrow_ == this->ncol_); assert(this->nnz_ > 0); @@ -251,40 +264,43 @@ The HIP implementation needs to be added to `src/base/hip/hip_matrix_csr.cpp` in return true; } -The corresponding HIP kernel should be placed in `src/base/hip/hip_kernels_csr.hpp`. +Place the corresponding HIP kernel in ``src/base/hip/hip_kernels_csr.hpp``. -Adding a Solver +Adding a solver =============== -In this example, a new solver shall be added to rocALUTION. - -API Enhancement ---------------- -First, the API for the new solver must be defined. -In this example, a new :cpp:class:`IterativeLinearSolver ` is added. -To achieve this, the :cpp:class:`CG ` is a good template. -Thus, we first copy `src/solvers/krylov/cg.hpp` to `src/solvers/krylov/mysolver.hpp` and `src/solvers/krylov.cg.cpp` to `src/solvers/krylov/mysolver.cpp` (assuming we add a krylov subspace solvers). - -Next, modify the `cg.hpp` and `cg.cpp` to your needs (e.g. change the solver name from `CG` to `MySolver`). -Each of the virtual functions in the class need an implementation. - -- **MySolver()**: The constructor of the new solver class. -- **~MySolver()**: The destructor of the new solver class. It should call the `Clear()` function. -- **void Print(void) const**: This function should print some informations about the solver. -- **void Build(void)**: This function creates all required structures of the solver, e.g. allocates memory and sets the backend of temporary objects. -- **void BuildMoveToAcceleratorAsync(void)**: This function should moves all solver related objects asynchronously to the accelerator device. -- **void Sync(void)**: This function should synchronize all solver related objects. -- **void ReBuildNumeric(void)**: This function should re-build the solver only numerically. -- **void Clear(void)**: This function should clean up all solver relevant structures that have been created using `Build()`. -- **void SolveNonPrecond_(const VectorType& rhs, VectorType* x)**: This function should perform the solving phase `Ax=y` without the use of a preconditioner. -- **void SolvePrecond_(const VectorType& rhs, VectorType* x)**: This function should perform the solving phase `Ax=y` with the use of a preconditioner. -- **void PrintStart_(void) const**: This protected function is called upton solver start. -- **void PrintEnd_(void) const**: This protected function is called when the solver ends. -- **void MoveToHostLocalData_(void)**: This protected function should move all local solver objects to the host. -- **void MoveToAcceleratorLocalData_(void)**: This protected function should move all local solver objects to the accelerator. - -Of course, additional member functions that are solver specific, can be introduced. - -Then, to make the new solver visible, we have to add it to the `src/rocalution.hpp` header: + +This section demonstrates how to add a new solver to rocALUTION. Here are the steps: + +1. Define the API for the new solver + +As an example, we add a new :cpp:class:`IterativeLinearSolver `. +To achieve this, we use :cpp:class:`CG ` as a template. +Thus, we first copy ``src/solvers/krylov/cg.hpp`` to ``src/solvers/krylov/mysolver.hpp`` and ``src/solvers/krylov.cg.cpp`` to ``src/solvers/krylov/mysolver.cpp`` (assuming we add a krylov subspace solvers). + +2. Modify the `cg.hpp` and `cg.cpp` as per your requirement (e.g. change the solver name from `CG` to `MySolver`) + +Implement each of the following virtual functions present in the class. Follow the implementation details given below: + +- ``MySolver()``: The constructor of the new solver class. +- ``~MySolver()``: The destructor of the new solver class. It calls the ``Clear()`` function. +- ``void Print(void) const``: Prints some informations about the solver. +- ``void Build(void)``: Creates all required structures of the solver, e.g. allocates memory and sets the backend of temporary objects. +- ``void BuildMoveToAcceleratorAsync(void)``: Moves all solver-related objects asynchronously to the accelerator device. +- ``void Sync(void)``: Synchronizes all solver related objects. +- ``void ReBuildNumeric(void)``: Rebuilds the solver only numerically. +- ``void Clear(void)``: Cleans up all solver-relevant structures that have been created using ``Build()``. +- ``void SolveNonPrecond_(const VectorType& rhs, VectorType* x)``: Performs the solving phase ``Ax=y`` without the use of a preconditioner. +- ``void SolvePrecond_(const VectorType& rhs, VectorType* x)``: Performs the solving phase ``Ax=y`` with the use of a preconditioner. +- ``void PrintStart_(void) const``: Protected function. Called when the solver starts. +- ``void PrintEnd_(void) const``: Protected function. Called when the solver ends. +- ``void MoveToHostLocalData_(void)``: Protected function. Moves all local solver objects to the host. +- ``void MoveToAcceleratorLocalData_(void)``: Protected function. Moves all local solver objects to the accelerator. + +You can also introduce any additional solver-specific member functions. + +3. Make the new solver visible + +To make the new solver visible, add it to the ``src/rocalution.hpp`` header: .. code-block:: cpp @@ -294,7 +310,9 @@ Then, to make the new solver visible, we have to add it to the `src/rocalution.h #include "solvers/krylov/cr.hpp" ... -Finally, the new solver must be added to the CMake compilation list, found in `src/solvers/CMakeLists.txt`: +4. Add the new solver to the CMake compilation list + +The CMake compilation list is found in ``src/solvers/CMakeLists.txt``: .. code-block:: cpp diff --git a/docs/design/orga.rst b/docs/design/orga.rst index d3333c30..81d66ec3 100644 --- a/docs/design/orga.rst +++ b/docs/design/orga.rst @@ -1,39 +1,44 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _source-code-organization: + ******************************** -Library Source Code Organization +Library source code organization ******************************** -Library Source Code Organization -================================ -The rocALUTION library is split into three major parts: +The rocALUTION library is split into three major directories: -- The `src/base/` directory contains all source code that is built on top of the :cpp:class:`BaseRocalution ` object as well as the backend structure. -- `src/solvers/` contains all solvers, preconditioners and its control classes. -- In `src/utils/` memory (de)allocation, logging, communication, timing and math helper functions are placed. +- ``src/base/``: Contains all source code that is built on top of the :cpp:class:`BaseRocalution ` object as well as the backend structure. +- ``src/solvers/``: Contains all solvers, preconditioners, and its control classes. +- ``src/utils/``: Contains memory (de)allocation, logging, communication, timing, and math helper functions. -The `src/base/` directory -------------------------- +``src/base/`` directory +---------------------------- + +The source files in the ``src/base/`` directory are listed below. Backend Manager ``````````````` The support of accelerator devices is embedded in the structure of rocALUTION. The primary goal is to use this technology whenever possible to decrease the computational time. -Each technology has its own backend implementation, dealing with platform specific initialization, synchronization, reservation, etc. functionality. -Currently available backends are for CPU (naive, OpenMP, MPI) and GPU (HIP). +Each technology has its own backend implementation, dealing with platform-specific functionalities such as initialization, synchronization, reservation, etc. +The backends are currently available for CPU (naive, OpenMP, MPI) and GPU (HIP). .. note:: Not all functions are ported and present on the accelerator backend. - This limited functionality is natural, since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.). + This limited functionality is natural, since all operations can't be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.). The Operator and Vector classes ``````````````````````````````` -The :cpp:class:`Operator ` and :cpp:class:`Vector ` classes and its derived local and global classes, are the classes available by the rocALUTION API. -While granting the user access to all relevant functionality, all hardware relevant implementation details are hidden. +The :cpp:class:`Operator ` and :cpp:class:`Vector ` classes and their derived local and global classes are the classes available through the rocALUTION API. +While granting access to all relevant functionalities, all hardware-relevant implementation details are hidden. Those linear operators and vectors are the main objects in rocALUTION. -They can be moved to an accelerator at run time. +They can be moved to an accelerator at run-time. -The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). -The only template parameter of the operators and vectors is the data type (ValueType). -:numref:`operatorsd` gives an overview of supported operators and vectors. +The linear operators are defined as local or global metrices (i.e. on a single node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). +The only template parameter of the operators and vectors is the data type (ValueType). The figure below provides an overview of supported operators and vectors. .. _operatorsd: .. figure:: ../data/operators.png @@ -42,60 +47,60 @@ The only template parameter of the operators and vectors is the data type (Value Operator and vector classes. -Each of the objects contain a local copy of the hardware descriptor created by the :cpp:func:`init_rocalution ` function. -Additionally, each local object that is derived from an operator or vector, contains a pointer to a `Base`-class, a `Host`-class and an `Accelerator`-class of same type (e.g. a :cpp:class:`LocalMatrix ` contains pointers to a :cpp:class:`BaseMatrix `, :cpp:class:`HostMatrix ` and :cpp:class:`AcceleratorMatrix `). -The `Base`-class pointer will always point towards either the `Host`-class or the `Accelerator`-class pointer, dependend on the runtime decision of the local object. -`Base`-classes and their derivatives are further explained in :ref:`rocalution_base_classes`. +Each object contains a local copy of the hardware descriptor created by the :cpp:func:`init_rocalution ` function. +Additionally, each local object that is derived from an operator or vector, contains a pointer to a `Base`-class, a `Host`-class and an `Accelerator`-class of same type (e.g. a :cpp:class:`LocalMatrix ` contains pointers to :cpp:class:`BaseMatrix `, :cpp:class:`HostMatrix ` and :cpp:class:`AcceleratorMatrix `). +The ``Base`` class pointer always points either towards the ``Host`` class or the ``Accelerator`` class pointer depending on the runtime decision of the local object. +``Base`` classes and their derivatives are further explained in :ref:`rocalution_base_classes`. -Furthermore, each global object, derived from an operator or vector, embeds two `Local`-classes of same type to store the interior and ghost part of the global object (e.g. a :cpp:class:`GlobalVector ` contains two :cpp:class:`LocalVector `). -For more details on distributed data structures, see the user manual. +Furthermore, each global object derived from an operator or vector embeds two ``Local`` classes of the same type to store the interior and ghost part of the global object (e.g. a :cpp:class:`GlobalVector ` contains two :cpp:class:`LocalVector `). +For more details on distributed data structures, see the API reference section. .. _rocalution_base_classes: The BaseMatrix and BaseVector classes ````````````````````````````````````` -The `data` is an object, pointing to the BaseMatrix class. -The pointing is coming from either a HostMatrix or an AcceleratorMatrix. -The AcceleratorMatrix is created by an object with an implementation in the backend and a matrix format. -Switching between host and accelerator matrices is performed in the LocalMatrix class. -The LocalVector is organized in the same way. +The ``data`` is an object pointing to the ``BaseMatrix`` class from either a ``HostMatrix`` or an ``AcceleratorMatrix``. +The ``AcceleratorMatrix`` is created by an object with an implementation in the backend and a matrix format. +Switching between host and accelerator metrices is performed in the ``LocalMatrix`` class. +The ``LocalVector`` is organized in the same way. -Each matrix format has its own class for the host and for the accelerator backend. -All matrix classes are derived from the BaseMatrix, which provides the base interface for computation as well as for data accessing. +Each matrix format has its own class for the host and the accelerator backend. +All matrix classes are derived from the ``BaseMatrix``, which provides the base interface for computation as well as for accessing the data. -Each local object contains a pointer to a `Base`-class object. -While the `Base`-class is mainly pure virtual, their derivatives implement all platform specific functionality. +Each local object contains a pointer to a ``Base`` class object. +While the ``Base`` classes are purely virtual, their derivatives implement all platform-specific functionalities. Each of them is coupled to a rocALUTION backend descriptor. -While the :cpp:class:`HostMatrix `, :cpp:class:`HostStencil ` and :cpp:class:`HostVector ` classes implements all host functionality, :cpp:class:`AcceleratorMatrix `, :cpp:class:`AcceleratorStencil ` and :cpp:class:`AcceleratorVector ` contain accelerator related device code. -Each of the backend specializations are located in a different directory, e.g. `src/base/host` for host related classes and `src/base/hip` for accelerator / HIP related classes. +While the :cpp:class:`HostMatrix `, :cpp:class:`HostStencil ` and :cpp:class:`HostVector ` classes implement all host functionalities, :cpp:class:`AcceleratorMatrix `, :cpp:class:`AcceleratorStencil ` and :cpp:class:`AcceleratorVector ` contain accelerator-related device code. +Each backend specialization is located in a different directory, e.g. ``src/base/host`` for host-related classes and ``src/base/hip`` for accelerator/HIP-related classes. ParallelManager ``````````````` The parallel manager class handles the communication and the mapping of the global operators. -Each global operator and vector need to be initialized with a valid parallel manager in order to perform any operation. +Each global operator and vector needs to be initialized with a valid parallel manager to perform any operation. For many distributed simulations, the underlying operator is already distributed. -This information need to be passed to the parallel manager. -All communication functionality for the implementation of global algorithms is available in the rocALUTION communicator in `src/utils/communicator.hpp`. -For more details on distributed data structures, see the user manual. +This information must be passed to the parallel manager. +All communication-related functionalities for the implementation of global algorithms is available in the rocALUTION communicator in ``src/utils/communicator.hpp``. +For more details on distributed data structures, see the API Reference section. -The `src/solvers/` directory +``src/solvers/`` directory ---------------------------- -The :cpp:class:`Solver ` and its derived classes can be found in `src/solvers`. -The directory structure is further split into the sub-classes :cpp:class:`DirectLinearSolver ` in `src/solvers/direct`, :cpp:class:`IterativeLinearSolver ` in `src/solvers/krylov`, :cpp:class:`BaseMultiGrid ` in `src/solvers/multigrid` and :cpp:class:`Preconditioner ` in `src/solvers/preconditioners`. -Each of the solver is using an :cpp:class:`Operator `, :cpp:class:`Vector ` and data type as template parameters to solve a linear system of equations. + +The :cpp:class:`Solver ` and its derived classes can be found in ``src/solvers``. +The directory structure is further split into the sub-classes :cpp:class:`DirectLinearSolver ` in ``src/solvers/direct``, :cpp:class:`IterativeLinearSolver ` in ``src/solvers/krylov``, :cpp:class:`BaseMultiGrid ` in ``src/solvers/multigrid`` and :cpp:class:`Preconditioner ` in ``src/solvers/preconditioners``. +Each solver uses an :cpp:class:`Operator `, :cpp:class:`Vector ` and data type as template parameters to solve a linear system of equations. The actual solver algorithm is implemented by the :cpp:class:`Operator ` and :cpp:class:`Vector ` functionality. Most of the solvers can be performed on linear operators, e.g. :cpp:class:`LocalMatrix `, :cpp:class:`LocalStencil ` and :cpp:class:`GlobalMatrix ` - i.e. the solvers can be performed locally (on a shared memory system) or in a distributed manner (on a cluster) via MPI. All solvers and preconditioners need three template parameters - Operators, Vectors and Scalar type. -The Solver class is purely virtual and provides an interface for +The Solver class is purely virtual and provides an interface for: -- :cpp:func:`SetOperator ` to set the operator, i.e. the user can pass the matrix here. +- :cpp:func:`SetOperator ` to set the operator, which allows you to pass the matrix here. - :cpp:func:`Build ` to build the solver (including preconditioners, sub-solvers, etc.). - The user need to specify the operator first before building the solver. + You must specify the operator before building the solver. - :cpp:func:`Solve ` to solve the sparse linear system. - The user need to pass a right-hand side and a solution / initial guess vector. + You need to pass a right-hand side and a solution / initial guess vector. - :cpp:func:`Print ` to show solver information. -- :cpp:func:`ReBuildNumeric ` to only re-build the solver numerically (if possible). +- :cpp:func:`ReBuildNumeric ` to only rebuild the solver numerically (if possible). - :cpp:func:`MoveToHost ` and :cpp:func:`MoveToAccelerator ` to offload the solver (including preconditioners and sub-solvers) to the host / accelerator. .. _solvers: @@ -105,18 +110,18 @@ The Solver class is purely virtual and provides an interface for Solver and preconditioner classes. -The `src/utils/` directory +``src/utils/`` directory -------------------------- -In the `src/utils` directory, all commonly used host (de)allocation, timing, math, communication and logging functionality is gathered. +In the ``src/utils`` directory, all commonly used host (de)allocation, timing, math, communication, and logging functionalities are gathered. -Furthermore, the rocALUTION `GlobalType`, which is the indexing type for global, distributed structures, can be adjusted in `src/utils/types.hpp`. +Furthermore, the rocALUTION ``GlobalType``, which is the indexing type for global and distributed structures, can be adjusted in ``src/utils/types.hpp``. By default, rocALUTION uses 64-bit wide global indexing. .. note:: It is not recommended to switch to 32-bit global indexing. -In `src/utils/def.hpp` +In ``src/utils/def.hpp``: -- verbosity level `VERBOSE_LEVEL` can be adjusted, see :ref:`rocalution_verbose`, -- debug mode `DEBUG_MODE` can be enabled, see :ref:`rocalution_debug`, -- MPI logging `LOG_MPI_RANK` can be modified, see :ref:`rocalution_logging`, -- and object tracking `OBJ_TRACKING_OFF` can be enabled, see :ref:`rocalution_obj_tracking`. +- Verbosity level ``VERBOSE_LEVEL`` can be adjusted, see :ref:`rocalution_verbose`, +- Debug mode ``DEBUG_MODE`` can be enabled, see :ref:`rocalution_debug`, +- MPI logging ``LOG_MPI_RANK`` can be modified, see :ref:`rocalution_logging`, +- Object tracking ``OBJ_TRACKING_OFF`` can be enabled, see :ref:`rocalution_obj_tracking`. diff --git a/docs/index.rst b/docs/index.rst index 3fff0a6a..220e09c9 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,15 +1,48 @@ -######################## -rocALUTION Documentation -######################## +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool -rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains, targeting modern CPU and GPU platforms. -Based on C++ and HIP, it provides a portable, generic and flexible design that allows seamless integration with other scientific software packages. +.. _index: -In the following, three separate chapters are available: +=========================== +rocALUTION documentation +=========================== - * Installation Guide (either `Linux `__ or `Windows `__): Describes how to install and configure the rocALUTION library; designed - to get users up and running quickly with the library - * :ref:`user_manual`: This is the manual of rocALUTION. It can be seen as a starting guide for new users but also a reference book for more experienced users. - * :ref:`design_document`: The Design Document is targeted to advanced users / developers that want to understand, modify or extend the functionality of the rocALUTION library. - To embed rocALUTION into your project, it is not required to read the Design Document. - * :ref:`api`: This is a list of API functions provided by rocALUTION. +rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains. To learn more, see :ref:`what-is-rocalution` + +You can access rocALUTION code on our `GitHub repository `_. + +Our documentation is structured as follows: + +.. grid:: 2 + :gutter: 3 + + .. grid-item-card:: Install + + * :ref:`linux-installation` + * :ref:`windows-installation` + * :ref:`supported-targets` + + .. grid-item-card:: API reference + + * :ref:`basics` + * :ref:`single-node` + * :ref:`multi-node` + * :ref:`solver-class` + * :ref:`preconditioners` + * :ref:`backends` + * :ref:`api` + * :ref:`remarks` + + .. grid-item-card:: Contribution + + * :ref:`design-philosophy` + * :ref:`source-code-organization` + * :ref:`functionality-extension` + * :ref:`functionality-table` + * :ref:`clients` + +To contribute to the documentation, refer to +`Contributing to ROCm `_. + +You can find licensing information on the `Licensing `_ page. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 3db34202..334c6cf0 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -4,9 +4,30 @@ root: index subtrees: - numbered: False entries: - - file: design/designdoc - - file: usermanual/usermanual - - file: api + - file: what-is-rocalution + - caption: Install + entries: + - file: usermanual/linux-installation + - file: usermanual/windows-installation + - file: usermanual/targets + - caption: API reference + entries: + - file: usermanual/basics + - file: usermanual/singlenode + - file: usermanual/multinode + - file: api/solvers + - file: api/precond + - file: api/backend + - file: api/api + - file: usermanual/remarks + - caption: Contribution + entries: + - file: design/design + - file: design/orga + - file: design/guides + - file: design/functable + - file: design/clients - caption: About entries: - file: license + diff --git a/docs/usermanual/Windows_Install_Guide.rst b/docs/usermanual/Windows_Install_Guide.rst deleted file mode 100644 index c0775bd0..00000000 --- a/docs/usermanual/Windows_Install_Guide.rst +++ /dev/null @@ -1,166 +0,0 @@ -===================================== -Installation and Building for Windows -===================================== - -------------- -Prerequisites -------------- - -- An AMD HIP SDK-enabled platform. You can find more information in the `ROCm documentation `_. -- rocALUTION is supported on the same Windows versions and toolchains that are supported by the HIP SDK. -- As the AMD HIP SDK is new and quickly evolving it will have more up to date information regarding the SDK's internal contents. Thus it may overrule statements found in this section on installing and building for Windows. - - ----------------------------- -Installing Prebuilt Packages ----------------------------- - -rocALUTION can be installed on Windows 11 or Windows 10 using the AMD HIP SDK installer. - -The simplest way to use rocALUTION in your code would be using CMake for which you would add the SDK installation location to your -`CMAKE_PREFIX_PATH`. Note you need to use quotes as the path contains a space, e.g., - -:: - - -DCMAKE_PREFIX_PATH="C:\Program Files\AMD\ROCm\5.5" - - -in your CMake configure step and then in your CMakeLists.txt use - -:: - - find_package(rocalution) - - target_link_libraries( your_exe PRIVATE roc::rocalution ) - -The rocalution.hpp header file must be included in the user code to make calls -into rocALUTION, and the rocALUTION import library and dynamic link library will become respective link-time and run-time -dependencies for the user application. - -Once installed, find rocalution.hpp in the HIP SDK `\\include\\rocalution` -directory. Only use these two installed files when needed in user code. - ----------------------------------- -Building and Installing rocALUTION ----------------------------------- - -Building from source is not necessary, as rocALUTION can be used after installing the pre-built packages as described above. -If desired, the following instructions can be used to build rocALUTION from source. - -Requirements -^^^^^^^^^^^^ -- `git `_ -- `CMake `_ 3.5 or later -- `AMD ROCm `_ 2.9 or later (optional, for HIP support) -- `rocSPARSE `_ (optional, for HIP support) -- `rocBLAS `_ (optional, for HIP support) -- `rocPRIM `_ (optional, for HIP support) -- `OpenMP `_ (optional, for OpenMP support) -- `MPI `_ (optional, for multi-node / multi-GPU support) -- `googletest `_ (optional, for clients) - - -Download rocALUTION -^^^^^^^^^^^^^^^^^^^ - -The rocALUTION source code, which is the same as for the ROCm linux distributions, is available at the `rocALUTION github page `_. -The version of the ROCm HIP SDK may be shown in the path of default installation, but -you can run the HIP SDK compiler to report the verison from the bin/ folder with: - -:: - - hipcc --version - -The HIP version has major, minor, and patch fields, possibly followed by a build specific identifier. For example, HIP version could be 5.4.22880-135e1ab4; -this corresponds to major = 5, minor = 4, patch = 22880, build identifier 135e1ab4. -There are GitHub branches at the rocALUTION site with names release/rocm-rel-major.minor where major and minor are the same as in the HIP version. -For example for you can use the following to download rocALUTION: - -:: - - git clone -b release/rocm-rel-x.y https://github.com/ROCmSoftwarePlatform/rocALUTION.git - cd rocALUTION - -Replace x.y in the above command with the version of HIP SDK installed on your machine. For example, if you have HIP 5.5 installed, then use -b release/rocm-rel-5.5 -You can can add the SDK tools to your path with an entry like: - -:: - - %HIP_PATH%\bin - -Building -^^^^^^^^ - -Below are steps to build using the `rmake.py` script. The user can build either: - -* library - -* library + client - -You only need (library) if you call rocALUTION from your code and only want the library built. -The client contains testing and benchmark tools. rmake.py will print to the screen the full cmake command being used to configure rocALUTION based on your rmake command line options. -This full cmake command can be used in your own build scripts if you want to bypass the python helper script for a fixed set of build options. - - -Build Library -^^^^^^^^^^^^^ - -Common uses of rmake.py to build (library) are -in the table below: - -.. tabularcolumns:: - |\X{1}{4}|\X{3}{4}| - -+--------------------+--------------------------+ -| Command | Description | -+====================+==========================+ -| ``./rmake.py -h`` | Help information. | -+--------------------+--------------------------+ -| ``./rmake.py`` | Build library. | -+--------------------+--------------------------+ -| ``./rmake.py -i`` | Build library, then | -| | build and install | -| | rocALUTION package. | -| | If you want to keep | -| | rocALUTION in your local | -| | tree, you do not | -| | need the -i flag. | -+--------------------+--------------------------+ - - -Build Library + Client -^^^^^^^^^^^^^^^^^^^^^^ - -Some client executables (.exe) are listed in the table below: - -====================== ================================================== -executable name description -====================== ================================================== -rocalution-test runs Google Tests to test the library -rocalution-bench executable to benchmark or test functions -./cg lap_25.mtx execute conjugate gradient example - (must download mtx matrix file you wish to use) -====================== ================================================== - -Common uses of rmake.py to build (library + client) are -in the table below: - -.. tabularcolumns:: - |\X{1}{4}|\X{3}{4}| - -+------------------------+--------------------------+ -| Command | Description | -+========================+==========================+ -| ``./rmake.py -h`` | Help information. | -+------------------------+--------------------------+ -| ``./rmake.py -c`` | Build library and client | -| | in your local directory. | -+------------------------+--------------------------+ -| ``./rmake.py -ic`` | Build and install | -| | rocALUTION package, and | -| | build the client. | -| | If you want to keep | -| | rocALUTION in your local | -| | directory, you do not | -| | need the -i flag. | -+------------------------+--------------------------+ diff --git a/docs/usermanual/backend.rst b/docs/usermanual/backend.rst deleted file mode 100644 index aa4ab913..00000000 --- a/docs/usermanual/backend.rst +++ /dev/null @@ -1,77 +0,0 @@ -******** -Backends -******** -The support of accelerator devices is embedded in the structure of rocALUTION. The primary goal is to use this technology whenever possible to decrease the computational time. -.. note:: Not all functions are ported and present on the accelerator backend. This limited functionality is natural, since not all operations can be performed efficiently on the accelerators (e.g. sequential algorithms, I/O from the file system, etc.). - -Currently, rocALUTION supports HIP capable GPUs starting with ROCm 1.9. Due to its design, the library can be easily extended to support future accelerator technologies. Such an extension of the library will not reflect the algorithms which are based on it. - -If a particular function is not implemented for the used accelerator, the library will move the object to the host and compute the routine there. In this case a warning message of level 2 will be printed. For example, if the user wants to perform an ILUT factorization on the HIP backend which is currently not available, the library will move the object to the host, perform the routine there and print the following warning message - -:: - - *** warning: LocalMatrix::ILUTFactorize() is performed on the host - -Moving Objects To and From the Accelerator -========================================== -All objects in rocALUTION can be moved to the accelerator and to the host. - -.. doxygenfunction:: rocalution::BaseRocalution::MoveToAccelerator -.. doxygenfunction:: rocalution::BaseRocalution::MoveToHost - -.. code-block:: cpp - - LocalMatrix mat; - LocalVector vec1, vec2; - - // Perform matrix vector multiplication on the host - mat.Apply(vec1, &vec2); - - // Move data to the accelerator - mat.MoveToAccelerator(); - vec1.MoveToAccelerator(); - vec2.MoveToAccelerator(); - - // Perform matrix vector multiplication on the accelerator - mat.Apply(vec1, &vec2); - - // Move data to the host - mat.MoveToHost(); - vec1.MoveToHost(); - vec2.MoveToHost(); - -Asynchronous Transfers -====================== -The rocALUTION library also provides asynchronous transfers of data between host and HIP backend. - -.. doxygenfunction:: rocalution::BaseRocalution::MoveToAcceleratorAsync -.. doxygenfunction:: rocalution::BaseRocalution::MoveToHostAsync -.. doxygenfunction:: rocalution::BaseRocalution::Sync - -This can be done with :cpp:func:`rocalution::LocalVector::CopyFromAsync` and :cpp:func:`rocalution::LocalMatrix::CopyFromAsync` or with `MoveToAcceleratorAsync()` and `MoveToHostAsync()`. These functions return immediately and perform the asynchronous transfer in background mode. The synchronization is done with `Sync()`. - -When using the `MoveToAcceleratorAsync()` and `MoveToHostAsync()` functions, the object will still point to its original location (i.e. host for calling `MoveToAcceleratorAsync()` and accelerator for `MoveToHostAsync()`). The object will switch to the new location after the `Sync()` function is called. - -.. note:: The objects should not be modified during an active asynchronous transfer. However, if this happens, the values after the synchronization might be wrong. -.. note:: To use the asynchronous transfers, you need to enable the pinned memory allocation. Uncomment `#define ROCALUTION_HIP_PINNED_MEMORY` in `src/utils/allocate_free.hpp`. - -Systems without Accelerators -============================ -rocALUTION provides full code compatibility on systems without accelerators, the user can take the code from the GPU system, re-compile the same code on a machine without a GPU and it will provide the same results. Any calls to :cpp:func:`rocalution::BaseRocalution::MoveToAccelerator` and :cpp:func:`rocalution::BaseRocalution::MoveToHost` will be ignored. - -Memory Allocations -================== -All data which is passed to and from rocALUTION is using the memory handling functions described in the code. By default, the library uses standard C++ *new* and *delete* functions for the host data. This can be changed by modifying `src/utils/allocate_free.cpp`. - -Allocation Problems -------------------- -If the allocation fails, the library will report an error and exits. If the user requires a special treatment, it has to be placed in `src/utils/allocate_free.cpp`. - -Memory Alignment ----------------- -The library can also handle special memory alignment functions. This feature need to be uncommented before the compilation process in `src/utils/allocate_free.cpp`. - -Pinned Memory Allocation (HIP) ------------------------------- -By default, the standard host memory allocation is realized by C++ *new* and *delete*. For faster PCI-Express transfers on HIP backend, the user can also use pinned host memory. This can be activated by uncommenting the corresponding macro in `src/utils/allocate_free.hpp`. - diff --git a/docs/usermanual/basics.rst b/docs/usermanual/basics.rst index a1abc1f4..f8179cd8 100644 --- a/docs/usermanual/basics.rst +++ b/docs/usermanual/basics.rst @@ -1,16 +1,25 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _basics: + ****** Basics ****** -Operators and Vectors +This document covers the basic information about rocALUTION APIs and their usage. + +Operators and vectors ===================== + The main objects in rocALUTION are linear operators and vectors. -All objects can be moved to an accelerator at run time. -The linear operators are defined as local or global matrices (i.e. on a single node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). -The only template parameter of the operators and vectors is the data type (ValueType). -The operator data type could be float, double, complex float or complex double, while the vector data type can be int, float, double, complex float or complex double (int is used mainly for the permutation vectors). -In the current version, cross ValueType object operations are not supported. :numref:`operators` gives an overview of supported operators and vectors. -Further details are also given in the :ref:`design_document`. +All objects can be moved to an accelerator at run-time. +The linear operators are defined as local or global metrices (i.e. on a single-node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). +The only template parameter of the operators and vectors is the data type (``ValueType``). +The operator data type could be float, double, complex float, or complex double, while the vector data type can be int, float, double, complex float or complex double (int is used mainly for the permutation vectors). +In the current version, cross ``ValueType`` object operations are not supported. The following figure gives an overview of supported operators and vectors. +For more details, refer to the :ref:`design-philosophy`. .. _operators: .. figure:: ../data/operators.png @@ -19,76 +28,86 @@ Further details are also given in the :ref:`design_document`. Operator and vector classes. -Each of the objects contain a local copy of the hardware descriptor created by the :cpp:func:`rocalution::init_rocalution` function. This allows the user to modify it according to his needs and to obtain two or more objects with different hardware specifications (e.g. different amount of OpenMP threads, HIP block sizes, etc.). +Each object contains a local copy of the hardware descriptor created by the :cpp:func:`rocalution::init_rocalution` function. This allows you to modify it according to your needs and to obtain two or more objects with different hardware specifications (e.g. different amounts of OpenMP threads, HIP block sizes, etc.). -Local Operators and Vectors +Local operators and vectors --------------------------- -By Local Operators and Vectors we refer to Local Matrices and Stencils and to Local Vectors. By Local we mean the fact that they stay on a single system. The system can contain several CPUs via UMA or NUMA memory system, it can also contain an accelerator. + +The local operators and vectors correspond to the local metrices and stencils, and local vectors. The term "local" implies the fact that they stay on a single system. A system can contain several CPUs via UMA or NUMA memory system, as well as an accelerator. .. doxygenclass:: rocalution::LocalMatrix .. doxygenclass:: rocalution::LocalStencil .. doxygenclass:: rocalution::LocalVector -Global Operators and Vectors +Global operators and vectors ---------------------------- -By Global Operators and Vectors we refer to Global Matrix and to Global Vectors. By Global we mean the fact they can stay on a single or multiple nodes in a network. For this type of computation, the communication is based on MPI. + +Global operators and vectors correspond to the global matrix and global vectors. The term "global" implies the fact that they stay on a single or multiple nodes in a network. For this type of computation, the communication is based on MPI. .. doxygenclass:: rocalution::GlobalMatrix .. doxygenclass:: rocalution::GlobalVector -Backend Descriptor and User Control +Backend descriptor and user control =================================== + Naturally, not all routines and algorithms can be performed efficiently on many-core systems (i.e. on accelerators). To provide full functionality, the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not, the object is moved to the host and the routine is computed there. -This guarantees that your code will run (maybe not in the most efficient way) with any accelerator regardless of the available functionality for it. +This ensures that the application runs (maybe not in the most efficient way) with any accelerator regardless of the availability of the required functionality for it. Initialization of rocALUTION ---------------------------- -The body of a rocALUTION code is very simple, it should contain the header file and the namespace of the library. -The program must contain an initialization call to :cpp:func:`init_rocalution ` which will check and allocate the hardware and a finalizing call to :cpp:func:`stop_rocalution ` which will release the allocated hardware. + +The body of a rocALUTION code should simply contain the header file and the namespace of the library. +The program must contain an initialization call to :cpp:func:`init_rocalution ` that checks and allocates the hardware and a finalizing call to :cpp:func:`stop_rocalution ` that releases the allocated hardware. .. doxygenfunction:: rocalution::init_rocalution .. doxygenfunction:: rocalution::stop_rocalution -Thread-core Mapping +Thread-core mapping ------------------- -The number of threads which rocALUTION will use can be modified by the function :cpp:func:`set_omp_threads_rocalution ` or by the global OpenMP environment variable (for Unix-like OS this is `OMP_NUM_THREADS`). + +The number of threads used by rocALUTION can be modified by the function :cpp:func:`set_omp_threads_rocalution ` or by the global OpenMP environment variable (for Unix-like OS this is `OMP_NUM_THREADS`). During the initialization phase, the library provides affinity thread-core mapping: -- If the number of cores (including SMT cores) is greater or equal than two times the number of threads, then all the threads can occupy every second core ID (e.g. 0,2,4,...). +- If the number of cores (including SMT cores) is greater than or equal to twice the number of threads, then all the threads can occupy every second core ID (e.g. 0,2,4,...). This is to avoid having two threads working on the same physical core, when SMT is enabled. -- If the number of threads is less or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0,1,2,3,...). -- If non of the above criteria is matched, then the default thread-core mapping is used (typically set by the operating system). +- If the number of threads is less than or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0,1,2,3,...). +- If none of the above criteria is matched, then the default thread-core mapping is used (typically set by the operating system). .. note:: The thread-core mapping is available for Unix-like operating systems only. -.. note:: The user can disable the thread affinity by :cpp:func:`set_omp_affinity_rocalution `, before initializing the library. +.. note:: The user can disable the thread affinity with :cpp:func:`set_omp_affinity_rocalution `, before initializing the library. -OpenMP Threshold Size +OpenMP threshold size --------------------- -Whenever working on a small problem, OpenMP host backend might be slightly slower than using no OpenMP. + +When working on a small problem, OpenMP host backend might be slightly slower than using no OpenMP. This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads. -This can be avoid by the OpenMP threshold size parameter in rocALUTION. -The default threshold is set to 10.000, which means that all matrices under (and equal to) this size will use only one thread (disregarding the number of OpenMP threads set in the system). -The threshold can be modified with :cpp:func:`set_omp_threshold_rocalution `. +This can be avoided by the OpenMP threshold size parameter in rocALUTION. +The default threshold is set to 10.000, which means that all metrices under (and equal to) this size use only one thread (irrespective of the number of OpenMP threads set in the system). +To modify the threshold, use :cpp:func:`set_omp_threshold_rocalution `. -Accelerator Selection +Accelerator selection --------------------- -The accelerator device id that is supposed to be used for the computation can be selected by the user by :cpp:func:`set_device_rocalution `. -Disable the Accelerator +To select the accelerator device id to be used for the computation, use :cpp:func:`set_device_rocalution `. + +Disable the accelerator ----------------------- -Furthermore, the accelerator can be disabled without having to re-compile the library by calling :cpp:func:`disable_accelerator_rocalution `. -Backend Information +To disable the accelerator without having to re-compile the library, use :cpp:func:`disable_accelerator_rocalution `. + +Backend information ------------------- -Detailed information about the current backend / accelerator in use as well as the available accelerators can be printed by :cpp:func:`info_rocalution `. -MPI and Multi-Accelerators +To print the detailed information about the current backend / accelerator in use as well as the available accelerators, use :cpp:func:`info_rocalution `. + +MPI and multi-accelerators -------------------------- -When initializing the library with MPI, the user need to pass the rank of the MPI process as well as the number of accelerators available on each node. -Basically, this way the user can specify the mapping of MPI process and accelerators - the allocated accelerator will be `rank % num_dev_per_node`. -Thus, the user can run two MPI processes on systems with two accelerators by specifying the number of devices to 2, as illustrated in the example code below. + +When initializing the library with MPI, you need to pass the rank of the MPI process as well as the number of accelerators available on each node. +Basically, this way you can specify the mapping of MPI process and accelerators - the allocated accelerator is ``rank % num_dev_per_node``. +Thus, you can run two MPI processes on systems with two accelerators by specifying the number of devices to 2, as illustrated in the example code below. .. code-block:: cpp @@ -121,54 +140,60 @@ Thus, the user can run two MPI processes on systems with two accelerators by spe .. _rocalution_obj_tracking: -Automatic Object Tracking +Automatic object tracking ========================= + rocALUTION supports automatic object tracking. After the initialization of the library, all objects created by the user application can be tracked. Once :cpp:func:`stop_rocalution ` is called, all memory from tracked objects gets deallocated. -This will avoid memory leaks when the objects are allocated but not freed. -The user can enable or disable the tracking by editing `src/utils/def.hpp`. +This avoids memory leaks when the objects are allocated but not freed. +The user can enable or disable the tracking by editing ``src/utils/def.hpp``. By default, automatic object tracking is disabled. .. _rocalution_verbose: -Verbose Output +Verbose output ============== + rocALUTION provides different levels of output messages. -The `VERBOSE_LEVEL` can be modified in `src/utils/def.hpp` before the compilation of the library. -By setting a higher level, the user will obtain more detailed information about the internal calls and data transfers to and from the accelerators. -By default, `VERBOSE_LEVEL` is set to 2. +The ``VERBOSE_LEVEL`` can be modified in ``src/utils/def.hpp`` before the compilation of the library. +By setting a higher level, you can obtain more detailed information about the internal calls and data transfers to and from the accelerators. +By default, the ``VERBOSE_LEVEL`` is set to 2. .. _rocalution_logging: -Verbose Output and MPI +Verbose output and MPI ====================== -To prevent all MPI processes from printing information to `stdout`, the default configuration is that only `RANK 0` outputs information. -The user can change the `RANK` or allow all processes to print setting `LOG_MPI_RANK` to 1 in `src/utils/def.hpp`. + +To prevent all MPI processes from printing information to ``stdout``, the default configuration allows only ``RANK 0`` to output information. +You can change the ``RANK`` or allow all processes to print by setting ``LOG_MPI_RANK`` to 1 in ``src/utils/def.hpp``. If file logging is enabled, all ranks write into the corresponding log files. .. _rocalution_debug: -Debug Output +Debug output ============ -Debug output will print almost every detail in the program, including object constructor / destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. -The flag `DEBUG_MODE` can be set in `src/utils/def.hpp`. -When enabled, additional `assert()s` are being checked during the computation. -This might decrease performance of some operations significantly. -File Logging +Debug output prints almost every detail in the program, including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for metrices, vectors, solvers, and preconditioners. +The flag ``DEBUG_MODE`` can be set in ``src/utils/def.hpp``. +When enabled, additional ``assert()s`` are checked during the computation. +This might significantly reduce the performance of some operations. + +File logging ============ -rocALUTION trace file logging can be enabled by setting the environment variable `ROCALUTION_LAYER` to 1. -rocALUTION will then log each rocALUTION function call including object constructor / destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers and preconditioners. -The log file will be placed in the working directory. -The log file naming convention is `rocalution-rank--.log`. -By default, the environment variable `ROCALUTION_LAYER` is unset, and logging is disabled. + +To enable rocALUTION trace file logging, set the environment variable ``ROCALUTION_LAYER`` to 1. +rocALUTION then logs each rocALUTION function call including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers, and preconditioners. +The log file is placed in the working directory. +The log file naming convention is ``rocalution-rank--.log``. +By default, the environment variable ``ROCALUTION_LAYER`` is unset and logging is disabled. .. note:: Performance might degrade when logging is enabled. Versions ======== -For checking the rocALUTION version in an application, pre-defined macros can be used: + +For checking the rocALUTION version in an application, use pre-defined macros: .. code-block:: cpp @@ -181,4 +206,4 @@ For checking the rocALUTION version in an application, pre-defined macros can be #define __ROCALUTION_VER // version -The final `__ROCALUTION_VER` holds the version number as `10000 * major + 100 * minor + patch`, as defined in `src/base/version.hpp.in`. +The final ``__ROCALUTION_VER`` holds the version number as ``10000 * major + 100 * minor + patch``, as defined in ``src/base/version.hpp.in``. diff --git a/docs/usermanual/install.rst b/docs/usermanual/install.rst deleted file mode 100644 index 7d75186b..00000000 --- a/docs/usermanual/install.rst +++ /dev/null @@ -1,12 +0,0 @@ -.. _rocalution_building: - -*********************** -Building and Installing -*********************** - -.. toctree:: - :maxdepth: 3 - :caption: Contents: - - Linux_Install_Guide - Windows_Install_Guide diff --git a/docs/usermanual/Linux_Install_Guide.rst b/docs/usermanual/linux-installation.rst similarity index 52% rename from docs/usermanual/Linux_Install_Guide.rst rename to docs/usermanual/linux-installation.rst index 83d964d1..55953e01 100644 --- a/docs/usermanual/Linux_Install_Guide.rst +++ b/docs/usermanual/linux-installation.rst @@ -1,25 +1,32 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _linux-installation: + =================================== -Installation and Building for Linux +Installation on Linux =================================== +This document provides information required to install and configure rocALUTION on Linux. + ------------- Prerequisites ------------- -- A ROCm enabled platform. `ROCm Documentation `_ has more information on - supported GPUs, Linux distributions, and Windows SKUs. It also has information on how to install ROCm. +A ROCm enabled platform. For information on supported GPUs, Linux distributions, ROCm installation, and Windows SKUs, refer to `ROCm Documentation `_. ----------------------------- Installing pre-built packages ----------------------------- -rocALUTION can be installed from `AMD ROCm repository `_. +You can install rocALUTION from `AMD ROCm repository `_. The repository hosts the single-node, accelerator enabled version of the library. -If a different setup is required, e.g. multi-node support, rocALUTION needs to be built from source, see :ref:`rocalution_build_from_source`. +If a different setup is required, e.g. multi-node support, build :ref:`rocALUTION from source `. For detailed instructions on how to set up ROCm on different platforms, see the `AMD ROCm Platform Installation Guide for Linux `_. -rocALUTION has the following run-time dependencies +rocALUTION has the following run-time dependencies: - `AMD ROCm `_ 2.9 or later (optional, for HIP support) - `rocSPARSE `_ (optional, for HIP support) @@ -37,7 +44,7 @@ Building from GitHub repository Requirements ^^^^^^^^^^^^ -To build rocALUTION from source, the following compile-time and run-time dependencies must be met +To build rocALUTION from source, ensure that the following compile-time and run-time dependencies are met: - `git `_ - `CMake `_ 3.5 or later @@ -59,49 +66,45 @@ Download the master branch using: $ git clone -b master https://github.com/ROCmSoftwarePlatform/rocALUTION.git $ cd rocALUTION -Below are steps to build different packages of the library, including dependencies and clients. -It is recommended to install rocALUTION using the `install.sh` script. +Below are the steps to build different packages of the library, including dependencies and clients. +It is recommended to install rocALUTION using the ``install.sh`` script. Using `install.sh` script to build rocALUTION with dependencies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The following table lists common uses of `install.sh` to build dependencies + library. Accelerator support via HIP and OpenMP will be enabled by default, whereas MPI is disabled. - -.. tabularcolumns:: - |\X{1}{6}|\X{5}{6}| - -========================== ==== -Command Description -========================== ==== -`./install.sh -h` Print help information. -`./install.sh -d` Build dependencies and library in your local directory. The `-d` flag only needs to be used once. For subsequent invocations of `install.sh` it is not necessary to rebuild the dependencies. -`./install.sh` Build library in your local directory. It is assumed dependencies are available. -`./install.sh -i` Build library, then build and install rocALUTION package in `/opt/rocm/rocalution`. You will be prompted for sudo access. This will install for all users. -`./install.sh --host` Build library in your local directory without HIP support. It is assumed dependencies are available. -`./install.sh --mpi=` Build library in your local directory with HIP and MPI support. It is assumed dependencies are available. -========================== ==== - -Using `install.sh` script to build rocALUTION with dependencies and clients -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The client contains example code, unit tests and benchmarks. Common uses of `install.sh` to build them are listed in the table below. - -.. tabularcolumns:: - |\X{1}{6}|\X{5}{6}| - -=================== ==== -Command Description -=================== ==== -`./install.sh -h` Print help information. -`./install.sh -dc` Build dependencies, library and client in your local directory. The `-d` flag only needs to be used once. For subsequent invocations of `install.sh` it is not necessary to rebuild the dependencies. -`./install.sh -c` Build library and client in your local directory. It is assumed dependencies are available. -`./install.sh -idc` Build library, dependencies and client, then build and install rocALUTION package in `/opt/rocm/rocalution`. You will be prompted for sudo access. This will install for all users. -`./install.sh -ic` Build library and client, then build and install rocALUTION package in `opt/rocm/rocalution`. You will be prompted for sudo access. This will install for all users. -=================== ==== +The following table lists the common uses of ``install.sh`` to build dependencies and the library. Accelerator support via HIP and OpenMP are enabled by default, whereas MPI is disabled. + +============================ ==== +Command Description +============================ ==== +``./install.sh -h`` Prints help information. +``./install.sh -d`` Builds dependencies and library in your local directory. The ``-d`` flag only needs to be used once. For subsequent invocations of ``install.sh`` it is not necessary to rebuild the dependencies. +``./install.sh`` Builds library in your local directory assuming the dependencies to be available. +``./install.sh -i`` Builds library, then builds and installs rocALUTION package in ``/opt/rocm/rocalution``. It prompts for sudo access which installs for all users. +``./install.sh --host`` Builds library in your local directory without HIP support assuming the dependencies to be available. +``./install.sh --mpi=`` Builds library in your local directory with HIP and MPI support assuming the dependencies to be available. +============================ ==== + +Using ``install.sh`` script to build rocALUTION with dependencies and clients +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The client contains example code, unit tests and benchmarks. Common uses of ``install.sh`` to build them are listed in the table below: + +===================== ==== +Command Description +===================== ==== +``./install.sh -h`` Prints help information. +``./install.sh -dc`` Builds dependencies, library and client in your local directory. The ``-d`` flag only needs to be used once. For subsequent invocations of ``install.sh`` it is not necessary to rebuild the dependencies. +``./install.sh -c`` Builds library and client in your local directory assuming the dependencies to be available. +``./install.sh -idc`` Builds library, dependencies and client, then builds and installs rocALUTION package in ``/opt/rocm/rocalution``. It prompts for sudo access which installs for all users. +``./install.sh -ic`` Builds library and client, then builds and installs rocALUTION package in ``opt/rocm/rocalution``. It prompts for sudo access which installs for all users. +===================== ==== Using individual commands to build rocALUTION ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -CMake 3.5 or later is required in order to build rocALUTION without the use of `install.sh`. -rocALUTION can be built with cmake using the following commands: +CMake 3.5 or later is required to build rocALUTION without the use of ``install.sh``. + +rocALUTION can be built with ``cmake`` using the following commands: :: @@ -122,7 +125,7 @@ rocALUTION can be built with cmake using the following commands: # Install rocALUTION to /opt/rocm sudo make install -`GoogleTest `_ is required in order to build all rocALUTION clients. +`GoogleTest `_ is required to build all rocALUTION clients. rocALUTION with dependencies and clients can be built using the following commands: @@ -148,36 +151,38 @@ rocALUTION with dependencies and clients can be built using the following comman # Install rocALUTION to /opt/rocm sudo make install -The compilation process produces a shared library file `librocalution.so` and `librocalution_hip.so` if HIP support is enabled. +The compilation process produces a shared library file ``librocalution.so`` and ``librocalution_hip.so`` if HIP support is enabled. Ensure that the library objects can be found in your library path. -If you do not copy the library to a specific location you can add the path under Linux in the `LD_LIBRARY_PATH` variable. +If you don't copy the library to a specific location you can add the path under Linux in the ``LD_LIBRARY_PATH`` variable. :: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: Common build problems -^^^^^^^^^^^^^^^^^^^^^ -#. **Issue:** Could not find a package file provided by "ROCM" with any of the following names: - ROCMConfig.cmake - rocm-config.cmake +^^^^^^^^^^^^^^^^^^^^^^^ + +#. **Issue:** Could not find any of the following package files provided by "ROCM": + - ROCMConfig.cmake + - rocm-config.cmake **Solution:** Install `ROCm cmake modules `_ either from source or from `AMD ROCm repository `_. -#. **Issue:** Could not find a package file provided by "ROCSPARSE" with any of the following names: - ROCSPARSE.cmake - rocsparse-config.cmake +#. **Issue:** Could not find any of the following package files provided by "ROCSPARSE": + - ROCSPARSE.cmake + - rocsparse-config.cmake **Solution:** Install `rocSPARSE `_ either from source or from `AMD ROCm repository `_. -#. **Issue:** Could not find a package file provided by "ROCBLAS" with any of the following names: - ROCBLAS.cmake - rocblas-config.cmake +#. **Issue:** Could not find any of the following package files provided by "ROCBLAS": + - ROCBLAS.cmake + - rocblas-config.cmake - **Solution:** Install `rocBLAS `_ either from source or from `AMD ROCm repository `_. + **Solution:** Install `rocBLAS `_ either from the source or from `AMD ROCm repository `_. -Simple Test +Simple test ^^^^^^^^^^^ + You can test the installation by running a CG solver on a sparse matrix. After successfully compiling the library, the CG solver example can be executed. diff --git a/docs/usermanual/multinode.rst b/docs/usermanual/multinode.rst index 67f0a644..c8341caf 100644 --- a/docs/usermanual/multinode.rst +++ b/docs/usermanual/multinode.rst @@ -1,12 +1,17 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _multi-node: + ********************** -Multi-node Computation +Multi-node computation ********************** -Introduction -============ -This chapter describes all base objects (matrices and vectors) for computation on multi-node (distributed memory) systems. +This document describes all the base objects (metrices and vectors) for computation on multi-node (distributed memory) systems. .. _multi-node1: + .. figure:: ../data/multi-node1.png :alt: multi-node system configuration :align: center @@ -43,18 +48,18 @@ To perform a sparse matrix-vector multiplication (SpMV), each process need to mu where :math:`I` stands for interior and :math:`G` stands for ghost. :math:`x_G` is a vector with three sections, coming from *P1*, *P2* and *P3*. The whole ghost part of the global vector is used mainly for the SpMV product. It does not play any role in the computation of vector-vector operations. -Code Structure +Code structure ============== Each object contains two local sub-objects. The global matrix stores interior and ghost matrix by local objects. Similarily, the global vector stores its data by two local objects. In addition to the local data, the global objects have information about the global communication through the parallel manager. .. _global_objects: .. figure:: ../data/global_objects.png - :alt: global matrices and vectors + :alt: global metrices and vectors :align: center - Global matrices and vectors. + Global metrices and vectors. -Parallel Manager +Parallel manager ================ .. doxygenclass:: rocalution::ParallelManager @@ -81,13 +86,13 @@ To setup a parallel manager, the required information is: * Local size of the interior/ghost for each process * Communication pattern (what information need to be sent to whom) -Global Matrices and Vectors +Global metrices and vectors =========================== .. doxygenfunction:: rocalution::GlobalMatrix::GetInterior .. doxygenfunction:: rocalution::GlobalMatrix::GetGhost .. doxygenfunction:: rocalution::GlobalVector::GetInterior -The global matrices and vectors store their data via two local objects. For the global matrix, the interior can be access via the :cpp:func:`rocalution::GlobalMatrix::GetInterior` and :cpp:func:`rocalution::GlobalMatrix::GetGhost` functions, which point to two valid local matrices. Similarily, the global vector can be accessed by :cpp:func:`rocalution::GlobalVector::GetInterior`. +The global metrices and vectors store their data via two local objects. For the global matrix, the interior can be access via the :cpp:func:`rocalution::GlobalMatrix::GetInterior` and :cpp:func:`rocalution::GlobalMatrix::GetGhost` functions, which point to two valid local metrices. Similarily, the global vector can be accessed by :cpp:func:`rocalution::GlobalVector::GetInterior`. Asynchronous SpMV ----------------- @@ -101,7 +106,7 @@ The user can store and load all global structures from and to files. For a solve * the sparse matrix * and the vector -Reading/writing from/to files can be done fully in parallel without any communication. :numref:`4x4_mpi` visualizes data of a :math:`4 \times 4` grid example which is distributed among 4 MPI processes (organized in :math:`2 \times 2`). Each local matrix stores the local unknowns (with local indexing). :numref:`4x4_mpi_rank0` furthermore illustrates the data associated with *RANK0*. +Reading/writing from/to files can be done fully in parallel without any communication. :numref:`4x4_mpi` visualizes data of a :math:`4 \times 4` grid example which is distributed among 4 MPI processes (organized in :math:`2 \times 2`). Each local matrix stores the local unknowns (with local indexing). :numref:`4x4_mpi_rank0` furthermore illustrates the data associated with ``RANK0``. .. _4x4_mpi: .. figure:: ../data/4x4_mpi.png @@ -116,10 +121,11 @@ Reading/writing from/to files can be done fully in parallel without any communic :alt: 4x4 grid, distributed in 4 domains (2x2), showing rank0 :align: center - An example of 4 MPI processes and the data associated with *RANK0*. + An example of 4 MPI processes and the data associated with ``RANK0``. -File Organization +File organization ----------------- + When the parallel manager, global matrix or global vector are writing to a file, the main file (passed as a file name to this function) will contain information for all files on all ranks. .. code-block:: RST @@ -147,42 +153,45 @@ When the parallel manager, global matrix or global vector are writing to a file, rhs.dat.rank.2 rhs.dat.rank.3 -Parallel Manager +Parallel manager ---------------- -The data for each rank can be split into receiving and sending information. For receiving data from neighboring processes, see :numref:`receiving`, *RANK0* need to know what type of data will be received and from whom. For sending data to neighboring processes, see :numref:`sending`, *RANK0* need to know where and what to send. + +The data for each rank can be split into receiving and sending information. For receiving data from neighboring processes, see :numref:`receiving`, ``RANK0`` need to know what type of data will be received and from whom. For sending data to neighboring processes, see :numref:`sending`, ``RANK0`` need to know where and what to send. .. _receiving: .. figure:: ../data/receiving.png :alt: receiving data example :align: center - An example of 4 MPI processes, *RANK0* receives data (the associated data is marked bold). + An example of 4 MPI processes, ``RANK0`` receives data (the associated data is marked bold). -To receive data, *RANK0* requires: +To receive data, ``RANK0`` requires: -* Number of MPI ranks, which will send data to *RANK0* (NUMBER_OF_RECEIVERS - integer value). -* Which are the MPI ranks, sending the data (RECEIVERS_RANK - integer array). -* How will the received data (from each rank) be stored in the ghost vector (RECEIVERS_INDEX_OFFSET - integer array). In this example, the first 30 elements will be received from *P1* :math:`[0, 2)` and the second 30 from *P2* :math:`[2, 4)`. +* Number of MPI ranks, which will send data to ``RANK0`` (``NUMBER_OF_RECEIVERS`` - integer value). +* Which are the MPI ranks, sending the data (``RECEIVERS_RANK`` - integer array). +* How will the received data (from each rank) be stored in the ghost vector (``RECEIVERS_INDEX_OFFSET`` - integer array). In this example, the first 30 elements will be received from *P1* :math:`[0, 2)` and the second 30 from *P2* :math:`[2, 4)`. .. _sending: .. figure:: ../data/sending.png :alt: sending data example :align: center - An example of 4 MPI processes, *RANK0* sends data (the associated data is marked bold). + An example of 4 MPI processes, ``RANK0`` sends data (the associated data is marked bold). -To send data, *RANK0* requires: +To send data, ``RANK0`` requires: -* Total size of the sending information (BOUNDARY_SIZE - integer value). -* Number of MPI ranks, which will receive data from *RANK0* (NUMBER_OF_SENDERS - integer value). -* Which are the MPI ranks, receiving the data (SENDERS_RANK - integer array). -* How will the sending data (from each rank) be stored in the sending buffer (SENDERS_INDEX_OFFSET - integer array). In this example, the first 30 elements will be sent to *P1* :math:`[0, 2)` and the second 30 to *P2* :math:`[2, 4)`. -* The elements, which need to be send (BOUNDARY_INDEX - integer array). In this example, the data which need to be send to *P1* and *P2* is the ghost layer, marked as ghost *P0*. The vertical stripe need to be send to *P1* and the horizontal stripe to *P2*. The numbering of local unknowns (in local indexing) for *P1* (the vertical stripes) are 1, 2 (size of 2) and stored in the BOUNDARY_INDEX. After 2 elements, the elements for *P2* are stored, they are 2, 3 (2 elements). +* Total size of the sending information (``BOUNDARY_SIZE`` - integer value). +* Number of MPI ranks, which will receive data from ``RANK0`` (``NUMBER_OF_SENDERS`` - integer value). +* Which are the MPI ranks, receiving the data (``SENDERS_RANK`` - integer array). +* How will the sending data (from each rank) be stored in the sending buffer (``SENDERS_INDEX_OFFSET`` - integer array). In this example, the first 30 elements will be sent to *P1* :math:`[0, 2)` and the second 30 to *P2* :math:`[2, 4)`. +* The elements, which need to be send (``BOUNDARY_INDEX`` - integer array). In this example, the data which need to be send to *P1* and *P2* is the ghost layer, marked as ghost *P0*. The vertical stripe need to be send to *P1* and the horizontal stripe to *P2*. The numbering of local unknowns (in local indexing) for *P1* (the vertical stripes) are 1, 2 (size of 2) and stored in the ``BOUNDARY_INDEX``. After 2 elements, the elements for ``P2`` are stored, they are 2, 3 (2 elements). -Matrices +Metrices -------- -Each rank hosts two local matrices, interior and ghost matrix. They can be stored in separate files, one for each matrix. The file format could be Matrix Market (MTX) or binary. + +Each rank hosts two local metrices, interior and ghost matrix. They can be stored in separate files, one for each matrix. The file format could be Matrix Market (MTX) or binary. Vectors ------- + Each rank holds the local interior vector only. It is stored in a single file. The file could be ASCII or binary. diff --git a/docs/usermanual/remarks.rst b/docs/usermanual/remarks.rst index 70ee3c12..117d29f6 100644 --- a/docs/usermanual/remarks.rst +++ b/docs/usermanual/remarks.rst @@ -1,3 +1,9 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _remarks: + ******* Remarks ******* @@ -14,7 +20,7 @@ Performance * Not all matrix conversions are performed on the device, the platform will give you a warning if the object need to be moved. * If you are deploying the rocALUTION library into another software framework try to design your integration functions to avoid :cpp:func:`rocalution::init_rocalution` and :cpp:func:`rocalution::stop_rocalution` every time you call a solver in the library. * Be sure to compile the library with the correct optimization level (-O3). -* Check, if your solver is really performed on the accelerator by printing the matrix information (:cpp:func:`rocalution::BaseRocalution::Info`) just before calling the :cpp:func:`rocalution::Solver::Solve` function. +* Check if your solver is really performed on the accelerator by printing the matrix information (:cpp:func:`rocalution::BaseRocalution::Info`) just before calling the :cpp:func:`rocalution::Solver::Solve` function. * Check the configuration of the library for your hardware with :cpp:func:`rocalution::info_rocalution`. * Mixed-Precision defect correction technique is recommended for accelerators (e.g. GPUs) with partial or no double precision support. The stopping criteria for the inner solver has to be tuned well for good performance. diff --git a/docs/usermanual/singlenode.rst b/docs/usermanual/singlenode.rst index 0720bf9a..a78d68bb 100644 --- a/docs/usermanual/singlenode.rst +++ b/docs/usermanual/singlenode.rst @@ -1,12 +1,16 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _single-node: + *********************** -Single-node Computation +Single-node computation *********************** -Introduction -============ -In this chapter, all base objects (matrices, vectors and stencils) for computation on a single-node (shared-memory) system are described. A typical configuration is illustrated in :numref:`single-node`. +In this document, all base objects (metrices, vectors, and stencils) for computation on a single-node (shared-memory) system are described. A typical configuration is illustrated in the figure below. -.. _single-node: +.. _single-node-figure: .. figure:: ../data/single-node.png :alt: single-node system configuration :align: center @@ -19,13 +23,13 @@ The compute node contains none, one or more accelerators. The compute node could ValueType ========= -The value (data) type of the vectors and the matrices is defined as a template. The matrix can be of type float (32-bit), double (64-bit) and complex (64/128-bit). The vector can be float (32-bit), double (64-bit), complex (64/128-bit) and int (32/64-bit). The information about the precision of the data type is shown in the :cpp:func:`rocalution::BaseRocalution::Info` function. +The value (data) type of the vectors and the metrices is defined as a template. The matrix can be of type float (32-bit), double (64-bit) and complex (64/128-bit). The vector can be float (32-bit), double (64-bit), complex (64/128-bit) and int (32/64-bit). The information about the precision of the data type is shown in the :cpp:func:`rocalution::BaseRocalution::Info` function. -Complex Support +Complex support =============== -Currently, rocALUTION does not support complex computation. +Currently, rocALUTION doesn't support complex computation. -Allocation and Free +Allocation and free =================== .. doxygenfunction:: rocalution::LocalVector::Allocate .. doxygenfunction:: rocalution::LocalVector::Clear @@ -50,9 +54,10 @@ Allocation and Free .. _matrix_formats: -Matrix Formats +Matrix formats ============== -Matrices, where most of the elements are equal to zero, are called sparse. In most practical applications, the number of non-zero entries is proportional to the size of the matrix (e.g. typically, if the matrix :math:`A \in \mathbb{R}^{N \times N}`, then the number of elements are of order :math:`O(N)`). To save memory, storing zero entries can be avoided by introducing a structure corresponding to the non-zero elements of the matrix. rocALUTION supports sparse CSR, MCSR, COO, ELL, DIA, HYB and dense matrices (DENSE). + +Metrices, where most of the elements are equal to zero, are called sparse. In most practical applications, the number of non-zero entries is proportional to the size of the matrix (e.g. typically, if the matrix :math:`A \in \mathbb{R}^{N \times N}`, then the number of elements are of order :math:`O(N)`). To save memory, storing zero entries can be avoided by introducing a structure corresponding to the non-zero elements of the matrix. rocALUTION supports sparse CSR, MCSR, COO, ELL, DIA, HYB and dense metrices (DENSE). .. note:: The functionality of every matrix object is different and depends on the matrix format. The CSR format provides the highest support for various functions. For a few operations, an internal conversion is performed, however, for many routines an error message is printed and the program is terminated. .. note:: In the current version, some of the conversions are performed on the host (disregarding the actual object allocation - host or accelerator). @@ -83,16 +88,17 @@ Matrices, where most of the elements are equal to zero, are called sparse. In mo COO storage format ------------------ -The most intuitive sparse format is the coordinate format (COO). It represents the non-zero elements of the matrix by their coordinates and requires two index arrays (one for row and one for column indexing) and the values array. A :math:`m \times n` matrix is represented by -=========== ================================================================== -m number of rows (integer). -n number of columns (integer). -nnz number of non-zero elements (integer). -coo_val array of ``nnz`` elements containing the data (floating point). -coo_row_ind array of ``nnz`` elements containing the row indices (integer). -coo_col_ind array of ``nnz`` elements containing the column indices (integer). -=========== ================================================================== +The most intuitive sparse format is the coordinate format (COO). It represents the non-zero elements of the matrix by their coordinates and requires two index arrays (one for row and one for column indexing) and the values array. A :math:`m \times n` matrix is represented by: + +================ ==================================================================== +``m`` Number of rows (integer). +``n`` Number of columns (integer). +``nnz`` Number of non-zero elements (integer). +``coo_val`` Array of ``nnz`` elements containing the data (floating point). +``coo_row_ind`` Array of ``nnz`` elements containing the row indices (integer). +``coo_col_ind`` Array of ``nnz`` elements containing the column indices (integer). +================ ==================================================================== .. note:: The COO matrix is expected to be sorted by row indices and column indices per row. Furthermore, each pair of indices should appear only once. @@ -118,17 +124,18 @@ where CSR storage format ------------------ + One of the most popular formats in many scientific codes is the compressed sparse row (CSR) format. In this format, instead of row indices, the row offsets to the beginning of each row are stored. Thus, each row elements can be accessed sequentially. However, this format does not allow sequential accessing of the column entries. -The CSR storage format represents a :math:`m \times n` matrix by +The CSR storage format represents a :math:`m \times n` matrix by: -=========== ========================================================================= -m number of rows (integer). -n number of columns (integer). -nnz number of non-zero elements (integer). -csr_val array of ``nnz`` elements containing the data (floating point). -csr_row_ptr array of ``m+1`` elements that point to the start of every row (integer). -csr_col_ind array of ``nnz`` elements containing the column indices (integer). -=========== ========================================================================= +=============== ========================================================================= +``m`` Number of rows (integer). +``n`` Number of columns (integer). +``nnz`` Number of non-zero elements (integer). +``csr_val`` Array of ``nnz`` elements containing the data (floating point). +``csr_row_ptr`` Array of ``m+1`` elements that point to the start of every row (integer). +``csr_col_ind`` Array of ``nnz`` elements containing the column indices (integer). +=============== ========================================================================= .. note:: The CSR matrix is expected to be sorted by column indices within each row. Furthermore, each pair of indices should appear only once. @@ -154,17 +161,17 @@ where BCSR storage format ------------------- -The Block Compressed Sparse Row (BCSR) storage format represents a :math:`(mb \cdot \text{bcsr_dim}) \times (nb \cdot \text{bcsr_dim})` matrix by - -============ ======================================================================================================================================== -mb number of block rows (integer) -nb number of block columns (integer) -nnzb number of non-zero blocks (integer) -bcsr_val array of ``nnzb * bcsr_dim * bcsr_dim`` elements containing the data (floating point). Data within each block is stored in column-major. -bcsr_row_ptr array of ``mb+1`` elements that point to the start of every block row (integer). -bcsr_col_ind array of ``nnzb`` elements containing the block column indices (integer). -bcsr_dim dimension of each block (integer). -============ ======================================================================================================================================== +The Block Compressed Sparse Row (BCSR) storage format represents a :math:`(mb \cdot \text{bcsr_dim}) \times (nb \cdot \text{bcsr_dim})` matrix by: + +================ ======================================================================================================================================== +``mb`` Number of block rows (integer) +``nb`` Number of block columns (integer) +``nnzb`` Number of non-zero blocks (integer) +``bcsr_val`` Array of ``nnzb * bcsr_dim * bcsr_dim`` elements containing the data (floating point). Data within each block is stored in column-major. +``bcsr_row_ptr`` Array of ``mb+1`` elements that point to the start of every block row (integer). +``bcsr_col_ind`` Array of ``nnzb`` elements containing the block column indices (integer). +``bcsr_dim`` Dimension of each block (integer). +================ ======================================================================================================================================== The BCSR matrix is expected to be sorted by column indices within each row. If :math:`m` or :math:`n` are not evenly divisible by the block dimension, then zeros are padded to the matrix, such that :math:`mb = (m + \text{bcsr_dim} - 1) / \text{bcsr_dim}` and :math:`nb = (n + \text{bcsr_dim} - 1) / \text{bcsr_dim}`. Consider the following :math:`4 \times 3` matrix and the corresponding BCSR structures, with :math:`\text{bcsr_dim} = 2, mb = 2, nb = 2` and :math:`\text{nnzb} = 4` using zero based indexing and column-major storage: @@ -220,16 +227,17 @@ with arrays representation ELL storage format ------------------ + The Ellpack-Itpack (ELL) storage format can be seen as a modification of the CSR format without row offset pointers. Instead, a fixed number of elements per row is stored. -It represents a :math:`m \times n` matrix by +It represents a :math:`m \times n` matrix by: -=========== ================================================================================ -m number of rows (integer). -n number of columns (integer). -ell_width maximum number of non-zero elements per row (integer) -ell_val array of ``m times ell_width`` elements containing the data (floating point). -ell_col_ind array of ``m times ell_width`` elements containing the column indices (integer). -=========== ================================================================================ +=============== ================================================================================ +``m`` Number of rows (integer). +``n`` Number of columns (integer). +``ell_width`` Maximum number of non-zero elements per row (integer) +``ell_val`` Array of ``m times ell_width`` elements containing the data (floating point). +``ell_col_ind`` Array of ``m times ell_width`` elements containing the column indices (integer). +=============== ================================================================================ .. note:: The ELL matrix is assumed to be stored in column-major format. Rows with less than ``ell_width`` non-zero elements are padded with zeros (``ell_val``) and :math:`-1` (``ell_col_ind``). @@ -256,16 +264,17 @@ where DIA storage format ------------------ + If all (or most) of the non-zero entries belong to a few diagonals of the matrix, they can be stored with the corresponding offsets. The values in DIA format are stored as array with size :math:`D \times N_D`, where :math:`D` is the number of diagonals in the matrix and :math:`N_D` is the number of elements in the main diagonal. Since not all values in this array are occupied, the not accessible entries are denoted with :math:`\ast`. They correspond to the offsets in the diagonal array (negative values represent offsets from the beginning of the array). -The DIA storage format represents a :math:`m \times n` matrix by +The DIA storage format represents a :math:`m \times n` matrix by: -========== ==== -m number of rows (integer) -n number of columns (integer) -ndiag number of occupied diagonals (integer) -dia_offset array of ``ndiag`` elements containing the offset with respect to the main diagonal (integer). -dia_val array of ``m times ndiag`` elements containing the values (floating point). -========== ==== +============== =============================================================================================== +``m`` Number of rows (integer) +``n`` Number of columns (integer) +``ndiag`` Number of occupied diagonals (integer) +``dia_offset`` Array of ``ndiag`` elements containing the offset with respect to the main diagonal (integer). +``dia_val`` Array of ``m times ndiag`` elements containing the values (floating point). +============== =============================================================================================== Consider the following :math:`5 \times 5` matrix and the corresponding DIA structures, with :math:`m = 5, n = 5` and :math:`\text{ndiag} = 4`: @@ -292,19 +301,20 @@ where HYB storage format ------------------ -The DIA and ELL formats cannot represent efficiently completely unstructured sparse matrices. To keep the memory footprint low, DIA requires the elements to belong to a few diagonals and ELL needs a fixed number of elements per row. For many applications this is a too strong restriction. A solution to this issue is to represent the more regular part of the matrix in such a format and the remaining part in COO format. The HYB format is a mixture between ELL and COO, where the maximum elements per row for the ELL part is computed by `nnz/m`. It represents a :math:`m \times n` matrix by - -=========== ========================================================================================= -m number of rows (integer). -n number of columns (integer). -nnz number of non-zero elements of the COO part (integer) -ell_width maximum number of non-zero elements per row of the ELL part (integer) -ell_val array of ``m times ell_width`` elements containing the ELL part data (floating point). -ell_col_ind array of ``m times ell_width`` elements containing the ELL part column indices (integer). -coo_val array of ``nnz`` elements containing the COO part data (floating point). -coo_row_ind array of ``nnz`` elements containing the COO part row indices (integer). -coo_col_ind array of ``nnz`` elements containing the COO part column indices (integer). -=========== ========================================================================================= + +The DIA and ELL formats cannot efficiently represent completely unstructured sparse metrices. To keep the memory footprint low, DIA requires the elements to belong to a few diagonals and ELL needs a fixed number of elements per row. For many applications this is a too strong restriction. A solution to this issue is to represent the more regular part of the matrix in such a format and the remaining part in COO format. The HYB format is a mixture between ELL and COO, where the maximum elements per row for the ELL part is computed by `nnz/m`. It represents a :math:`m \times n` matrix by: + +=============== ========================================================================================= +``m`` Number of rows (integer). +``n`` Number of columns (integer). +``nnz`` Number of non-zero elements of the COO part (integer) +``ell_width`` Maximum number of non-zero elements per row of the ELL part (integer) +``ell_val`` Array of ``m times ell_width`` elements containing the ELL part data (floating point). +``ell_col_ind`` Array of ``m times ell_width`` elements containing the ELL part column indices (integer). +``coo_val`` Array of ``nnz`` elements containing the COO part data (floating point). +``coo_row_ind`` Array of ``nnz`` elements containing the COO part row indices (integer). +``coo_col_ind`` Array of ``nnz`` elements containing the COO part column indices (integer). +=============== ========================================================================================= Memory Usage ------------ @@ -336,11 +346,11 @@ File I/O Access ====== -.. doxygenfunction:: rocalution::LocalVector::operator[](int) +.. doxygenfunction:: rocalution::LocalVector::&operator[](int) :outline: -.. doxygenfunction:: rocalution::LocalVector::operator[](int) const +.. doxygenfunction:: rocalution::LocalVector::&operator[](int) const -.. note:: Accessing elements via the *[]* operators is slow. Use this for debugging purposes only. There is no direct access to the elements of matrices due to the sparsity structure. Matrices can be imported by a copy function. For CSR matrices, this is :cpp:func:`rocalution::LocalMatrix::CopyFromCSR` and :cpp:func:`rocalution::LocalMatrix::CopyToCSR`. +.. note:: Accessing elements via the *[]* operators is slow. Use this for debugging purposes only. There is no direct access to the elements of metrices due to the sparsity structure. Metrices can be imported by a copy function. For CSR metrices, this is :cpp:func:`rocalution::LocalMatrix::CopyFromCSR` and :cpp:func:`rocalution::LocalMatrix::CopyToCSR`. .. code-block:: cpp @@ -359,14 +369,14 @@ Access mat.AllocateCSR("my_matrix", 345, 100, 100); mat.CopyFromCSR(csr_row_ptr, csr_col, csr_val); -Raw Access to the Data +Raw access to the data ====================== .. _SetDataPtr: SetDataPtr ---------- -For vector and matrix objects, direct access to the raw data can be obtained via pointers. Already allocated data can be set with *SetDataPtr*. Setting data pointers will leave the original pointers empty. +For vector and matrix objects, direct access to the raw data can be obtained via pointers. Already allocated data can be set with ``SetDataPtr``. Setting data pointers leaves the original pointers empty. .. doxygenfunction:: rocalution::LocalVector::SetDataPtr .. doxygenfunction:: rocalution::LocalMatrix::SetDataPtrCOO @@ -385,7 +395,8 @@ For vector and matrix objects, direct access to the raw data can be obtained via LeaveDataPtr ------------ -With *LeaveDataPtr*, the raw data from the object can be obtained. This will leave the object empty. + +With ``LeaveDataPtr``, the raw data from the object can be obtained. This leaves the object empty. .. doxygenfunction:: rocalution::LocalVector::LeaveDataPtr .. doxygenfunction:: rocalution::LocalMatrix::LeaveDataPtrCOO @@ -405,23 +416,26 @@ With *LeaveDataPtr*, the raw data from the object can be obtained. This will lea .. note:: Never rely on old pointers, hidden object movement to and from the accelerator will make them invalid. .. note:: Whenever you pass or obtain pointers to/from a rocALUTION object, you need to use the same memory allocation/free functions. Please check the source code for that (for host *src/utils/allocate_free.cpp* and for HIP *src/base/hip/hip_allocate_free.cpp*) -Copy CSR Matrix Host Data +Copy CSR matrix host data ========================= + .. doxygenfunction:: rocalution::LocalMatrix::CopyFromHostCSR -Copy Data +Copy data ========= -The user can copy data to and from a local vector by using *CopyFromData()* *CopyToData()*. + +You can copy data to and from a local vector by using ``CopyFromData()`` and ``CopyToData()``. .. doxygenfunction:: rocalution::LocalVector::CopyFromData .. doxygenfunction:: rocalution::LocalVector::CopyToData -Object Info +Object info =========== .. doxygenfunction:: rocalution::BaseRocalution::Info Copy ==== + All matrix and vector objects provide a *CopyFrom()* function. The destination object should have the same size or be empty. In the latter case, the object is allocated at the source platform. .. doxygenfunction:: rocalution::LocalVector::CopyFrom(const LocalVector&) @@ -429,38 +443,45 @@ All matrix and vector objects provide a *CopyFrom()* function. The destination o .. note:: For vectors, the user can specify source and destination offsets and thus copy only a part of the whole vector into another vector. -.. doxygenfunction:: rocalution::LocalVector::CopyFrom(const LocalVector&, int, int, int) +.. doxygenfunction:: rocalution::LocalVector::CopyFrom(const LocalVector&, int64_t, int64_t, int64_t) Clone ===== -The copy operators allow you to copy the values of the object to another object, without changing the backend specification of the object. In many algorithms, you might need auxiliary vectors or matrices. These objects can be cloned with CloneFrom(). + +The copy operators allow you to copy the values of the object to another object, without changing the backend specification of the object. In many algorithms, you might need auxiliary vectors or metrices. These objects can be cloned with ``CloneFrom()``. CloneFrom --------- + .. doxygenfunction:: rocalution::LocalVector::CloneFrom .. doxygenfunction:: rocalution::LocalMatrix::CloneFrom CloneBackend ------------ + .. doxygenfunction:: rocalution::BaseRocalution::CloneBackend(const BaseRocalution&) Check ===== + .. doxygenfunction:: rocalution::LocalVector::Check .. doxygenfunction:: rocalution::LocalMatrix::Check -Checks, if the object contains valid data. For vectors, the function checks if the values are not infinity and not NaN (not a number). For matrices, this function checks the values and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO matrices have to be sorted, etc.). +Checks if the object contains valid data. For vectors, the function checks if the values are not infinity and not NaN (not a number). For metrices, this function checks the values and if the structure of the matrix is correct (e.g. indices cannot be negative, CSR and COO metrices have to be sorted, etc.). Sort ==== + .. doxygenfunction:: rocalution::LocalMatrix::Sort Keying ====== + .. doxygenfunction:: rocalution::LocalMatrix::Key -Graph Analyzers +Graph analyzers =============== + The following functions are available for analyzing the connectivity in graph of the underlying sparse matrix. * (R)CMK Ordering @@ -471,27 +492,33 @@ The following functions are available for analyzing the connectivity in graph of All graph analyzing functions return a permutation vector (integer type), which is supposed to be used with the :cpp:func:`rocalution::LocalMatrix::Permute` and :cpp:func:`rocalution::LocalMatrix::PermuteBackward` functions in the matrix and vector classes. -Cuthill-McKee Ordering +Cuthill-McKee ordering ---------------------- + .. doxygenfunction:: rocalution::LocalMatrix::CMK .. doxygenfunction:: rocalution::LocalMatrix::RCMK -Maximal Independent Set +Maximal independent set ----------------------- + .. doxygenfunction:: rocalution::LocalMatrix::MaximalIndependentSet -Multi-Coloring +Multi-coloring -------------- + .. doxygenfunction:: rocalution::LocalMatrix::MultiColoring -Zero Block Permutation +Zero block permutation ---------------------- + .. doxygenfunction:: rocalution::LocalMatrix::ZeroBlockPermutation -Connectivity Ordering +Connectivity ordering --------------------- + .. doxygenfunction:: rocalution::LocalMatrix::ConnectivityOrder -Basic Linear Algebra Operations +Basic linear algebra operations =============================== -For a full list of functions and routines involving operators and vectors, see the API specifications. + +For a full list of functions and routines involving operators and vectors, see the :ref:`api`. diff --git a/docs/usermanual/targets.rst b/docs/usermanual/targets.rst index ef1c111d..824cd681 100644 --- a/docs/usermanual/targets.rst +++ b/docs/usermanual/targets.rst @@ -1,16 +1,22 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _supported-targets: + ################# -Supported Targets +Supported targets ################# -Currently, rocALUTION is supported under the following operating systems +Supported operating systems: - Ubuntu 16.04, Ubuntu 18.04 - CentOS 7 - SLES 15 -To compile and run rocALUTION with HIP support, `AMD ROCm Platform `_ 2.9 or newer is required. +To compile and run rocALUTION with HIP support, `AMD ROCm Platform `_ 2.9 or later is required. -The following HIP devices are currently supported +Supported HIP devices: - gfx803 (e.g. Fiji) - gfx900 (e.g. Vega10, MI25) diff --git a/docs/usermanual/usermanual.rst b/docs/usermanual/usermanual.rst deleted file mode 100644 index 4ee52c50..00000000 --- a/docs/usermanual/usermanual.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. _user_manual: - -########### -User Manual -########### - -.. toctree:: - :maxdepth: 3 - :caption: Contents: - - intro - install - basics - singlenode - multinode - solvers - precond - backend - remarks - targets diff --git a/docs/usermanual/windows-installation.rst b/docs/usermanual/windows-installation.rst new file mode 100644 index 00000000..0713749b --- /dev/null +++ b/docs/usermanual/windows-installation.rst @@ -0,0 +1,167 @@ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool + +.. _windows-installation: + +===================================== +Installation on Windows +===================================== + +This document provides information required to install and configure rocALUTION on Windows. + +------------- +Prerequisites +------------- + +- An AMD HIP SDK-enabled platform. For more information, refer to the `ROCm documentation `_. +- rocALUTION is supported on the same Windows versions and toolchains that are supported by the HIP SDK. + +.. note:: + + As the AMD HIP SDK is under continuous development, the information updated regarding the SDK's internal contents may overrule the statements in this document on installing and building on Windows. + +---------------------------- +Installing prebuilt packages +---------------------------- + +rocALUTION can be installed on Windows 11 or Windows 10 using the AMD HIP SDK installer. + +The simplest way to use rocALUTION in your code is to use ``CMake`` that requires you to add the SDK installation location to your +`DCMAKE_PREFIX_PATH`. Note that you need to use quotes as the path contains a space, e.g., + +:: + + -DCMAKE_PREFIX_PATH="C:\Program Files\AMD\ROCm\5.5" + + +After CMake configuration, in your ``CMakeLists.txt`` use: + +:: + + find_package(rocalution) + + target_link_libraries( your_exe PRIVATE roc::rocalution ) + +Once rocALUTION is installed, you can find ``rocalution.hpp`` in the HIP SDK ``\\include\\rocalution`` +directory. Use only the installed file in the user application if needed. +You must include ``rocalution.hpp`` header file in the user code to make calls +into rocALUTION, so that the rocALUTION import library and dynamic link library become the respective link-time and run-time +dependencies for the user application. + +---------------------------------- +Building and installing rocALUTION +---------------------------------- + +Building from source is not necessary, as rocALUTION can be used after installing the pre-built packages as described above. +If desired, you can follow the instructions below to build rocALUTION from source. + +Requirements +^^^^^^^^^^^^ +- `git `_ +- `CMake `_ 3.5 or later +- `AMD ROCm `_ 2.9 or later (optional, for HIP support) +- `rocSPARSE `_ (optional, for HIP support) +- `rocBLAS `_ (optional, for HIP support) +- `rocPRIM `_ (optional, for HIP support) +- `OpenMP `_ (optional, for OpenMP support) +- `MPI `_ (optional, for multi-node / multi-GPU support) +- `googletest `_ (optional, for clients) + +Download rocALUTION +^^^^^^^^^^^^^^^^^^^ + +The rocALUTION source code, which is the same as for the ROCm linux distributions, is available at the `rocALUTION github page `_. +The version of the ROCm HIP SDK may be shown in the path of default installation, but +you can run the HIP SDK compiler to report the version from the ``bin/`` folder using: + +:: + + hipcc --version + +The HIP version has major, minor, and patch fields, possibly followed by a build-specific identifier. For example, a HIP version 5.4.22880-135e1ab4 corresponds to major = 5, minor = 4, patch = 22880, and build identifier 135e1ab4. +There are GitHub branches at the rocALUTION site with names ``release/rocm-rel-major.minor`` where major and minor are the same as in the HIP version. +To download rocALUTION, use: + +:: + + git clone -b release/rocm-rel-x.y https://github.com/ROCmSoftwarePlatform/rocALUTION.git + cd rocALUTION + +Replace ``x.y`` in the above command with the version of HIP SDK installed on your machine. For example, if you have HIP 5.5 installed, then use ``-b release/rocm-rel-5.5``. +You can add the SDK tools to your path using: + +:: + + %HIP_PATH%\bin + +Build +^^^^^^^^ + +Below are the steps required to build using the ``rmake.py`` script. The user can build either of the following: + +* library + +* library and client + +You only need (library) if you call rocALUTION from your code and want to build the library alone. +The client contains testing and benchmarking tools. ``rmake.py`` prints the full ``cmake`` command being used to configure rocALUTION based on your ``rmake`` command-line options. +This full ``cmake`` command can be used in your own build scripts if you want to bypass the Python helper script for a fixed set of build options. + +Build library +^^^^^^^^^^^^^^ + +Common uses of ``rmake.py`` to build (library) are listed below: + +.. tabularcolumns:: + |\X{1}{4}|\X{3}{4}| + ++--------------------+-----------------------------+ +| Command | Description | ++====================+=============================+ +| ``./rmake.py -h`` | Help information. | ++--------------------+-----------------------------+ +| ``./rmake.py`` | Builds library. | ++--------------------+-----------------------------+ +| ``./rmake.py -i`` | Builds library, then | +| | builds and installs | +| | rocALUTION package. | +| | If you want to keep | +| | rocALUTION in your local | +| | tree, don't use ``-i`` flag.| ++--------------------+-----------------------------+ + +Build library and client +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Some client executables (.exe) are listed below: + +====================== ================================================== +Executable name Description +====================== ================================================== +``rocalution-test`` Runs Google Tests to test the library +``rocalution-bench`` Executable to benchmark or test functions +``./cg lap_25.mtx`` Executes conjugate gradient example + (must download ``mtx`` matrix file you wish to use) +====================== ================================================== + +Common uses of ``rmake.py`` to build (library and client) are listed below: + +.. tabularcolumns:: + |\X{1}{4}|\X{3}{4}| + ++------------------------+----------------------------------+ +| Command | Description | ++========================+==================================+ +| ``./rmake.py -h`` | Help information. | ++------------------------+----------------------------------+ +| ``./rmake.py -c`` | Builds library and client | +| | in your local directory. | ++------------------------+----------------------------------+ +| ``./rmake.py -ic`` | Builds and installs | +| | rocALUTION package, and | +| | builds the client. | +| | If you want to keep | +| | rocALUTION in your local | +| | directory, don't use ``-i`` flag.| ++------------------------+----------------------------------+ diff --git a/docs/usermanual/intro.rst b/docs/what-is-rocalution.rst similarity index 59% rename from docs/usermanual/intro.rst rename to docs/what-is-rocalution.rst index 564a725b..06856790 100644 --- a/docs/usermanual/intro.rst +++ b/docs/what-is-rocalution.rst @@ -1,11 +1,17 @@ -Introduction -============ +.. meta:: + :description: A sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains + :keywords: rocALUTION, ROCm, library, API, tool -Overview --------- -rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism, targeting modern processors and accelerators including multi/many-core CPU and GPU platforms. The main goal of this package is to provide a portable library for iterative sparse methods on state of the art hardware. rocALUTION can be seen as middle-ware between different parallel backends and application specific packages. +.. _what-is-rocalution: -The major features and characteristics of the library are +What is rocALUTION? +==================== + +rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism on top of the AMD ROCm runtime and toolchains, targeting modern processors and accelerators including multi and many-core CPU and GPU platforms. The main goal of this package is to provide a portable library for iterative sparse methods on state of the art hardware. +rocALUTION can be seen as the middleware between different parallel backends and application-specific packages. +Based on C++ and HIP, it provides a portable, generic and flexible design that allows seamless integration with other scientific software packages. + +The major features and characteristics of the rocALUTION library are: * Various backends * Host - fallback backend, designed for CPUs @@ -13,9 +19,9 @@ The major features and characteristics of the library are * OpenMP - designed for multi-core CPUs * MPI - designed for multi-node and multi-GPU configurations * Easy to use - The syntax and structure of the library provide easy learning curves. With the help of the examples, anyone can try out the library - no knowledge in HIP, OpenMP or MPI programming required. + The syntax and structure of the library provide easy learning curves. With the help of examples, anyone can try out the library. No knowledge in HIP, OpenMP, or MPI programming is required. * No special hardware requirements - There are no hardware requirements to install and run rocALUTION. If a GPU device and HIP is available, the library will use them. + There are no hardware requirements to install and run rocALUTION. All you need is a GPU device and HIP. * Variety of iterative solvers * Fixed-Point iteration - Jacobi, Gauss-Seidel, Symmetric-Gauss Seidel, SOR and SSOR * Krylov subspace methods - CR, CG, BiCGStab, BiCGStab(l), GMRES, IDR, QMRCGSTAB, Flexible CG/GMRES @@ -38,20 +44,3 @@ The major features and characteristics of the library are Compressed Sparse Row (CSR), Modified Compressed Sparse Row (MCSR), Dense (DENSE), Coordinate (COO), ELL, Diagonal (DIA), Hybrid format of ELL and COO (HYB). The code is open-source under MIT license, see :ref:`rocalution_license` and hosted on the `GitHub rocALUTION page `_. - -.. _rocalution_license: - -License -------- - -rocALUTION is distributed as open-source under the following license: - -MIT License - -Copyright (C) 2018 Advanced Micro Devices, Inc. All rights reserved. - -Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.