Feature/mrhs solvers #1489

maddyscientist · 2024-09-05T18:15:23Z

This is PR is a biggie:

Adds MRHS solvers to QUDA (at present CG, MR, SD, GCR, CA-GCR) are all implemented
- Reliable updates performed using 0th RHS
- Do not flag convergence until all RHS are converged
Adds MRHS support to multigrid
- Batched null space finding implemented, exposed with new parameter QudaMultigridParam::n_vec_batch
- MRHS supported in actual solves as well
Expose MRHS solvers using invertMultiSrcQuda interface
- This is compatible with the split-grid interface, batching is used across the number of sources per sub grid
Explicit breaking of the interface
- QudaInvertParam::true_res and QudaInvertParam::true_res_hq are now arrays
Batched deflation implemented
- Batch eigenvalue deflation confirmed working using staggered fermions
- Batch singular-value deflation confirmed working as a coarse grid deflator
Eigenvalue computation is now batch, to take advantage of MRHS
All Dirac::prepare and Dirac::reconstruct functions are now MRHS optimized
Chronological solver is now MRHS optimized
Cast from cvector<T> to T is now explicit instead of implicit
- This makes it much easier to catch bugs when updating code to be MRHS
Tensor-core 3xFP16 DslashCoarse is now robust to underflow
Miscellaneous cleanup and additions to aid all of the above
Fixes some earlier bugs introduced in prior MRHS PRs
Add MMA instantiations for 32 -> 64 coarsening
Since QUDA is now threaded, update to using MPI_THREAD_FUNNELED
Augmentations to the power monitoring

Things left to do

Add real MRHS support to all solvers (or at least to a few outstanding ones: CGNE, CGNR, BiCGStab, etc.)
Verify NVSHMEM MRHS operators are all working as expected
Update QUDA version number due to breakage of interface (QUDA 2.0?)
Fix reporting of true residual per RHS when running split grid

… size to use when generating null-space vectors. This is the parameter used to enable MRHS null space generation. Updated the null-space generation to work on vectors of this width

…-> scalar cast operator explicit instead of implicit

…ch - will be needed for batched CG

…ing solvers

…rator explict

… issues at compile time, the scalar -> vector<scalar> cast operator is now explicit. Apply necessary changes to ensure that stuff doesn't break with this change

…olvers

…ossible

…of invertQuda to new function solve which is MRHS aware

… also now called by the invertMultiSrcQuda interface

…nd writes into atoms. This fixes an issue with the MR-based Schwarz solver when using heterogenous reductions with batched solves

maddyscientist · 2024-10-03T23:59:06Z

This PR is now functionally complete, and all tests are passing. This is ready for final review (@weinbe2 @hummingtree @mathiaswagner @bjoo).

weinbe2 · 2024-10-04T01:27:35Z

lib/inv_bicgstab_quda.cpp

@@ -371,10 +370,12 @@ namespace quda {

    if (!param.is_preconditioner) { // do not do the below if we this is an inner solver


In the spirit of typo fixing, if we this is -> if this is

weinbe2 · 2024-10-04T17:50:16Z

I have tested the batch CG solver with a modified version of MILC that properly utilizes the multisource MILC interface function. This is available here: https://github.com/lattice/milc_qcd/tree/feature/quda-block-solver-interface ; current commit is lattice/milc_qcd@f0404fe . This PR works perfectly fine with the current develop version of MILC.

I will note that this has only tested vanilla CG. I have not yet plumbed in multi-rhs support for the MG solver; I consider that within the scope of a second QUDA PR.

weinbe2 · 2024-10-04T20:46:13Z

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

maddyscientist · 2024-10-04T21:01:31Z

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

Ok, I understand this issue. There's two things at play here:

For whatever reason --mg-dslash-use-mma i acts on the i + 1 level, so you should set --mg-dslash-use-mma 1 false`, somewhat counter intuitively. This was likely an oversight from when the MMA dslash was added. I can fix this.
If the evec batch size isn't set at the command line, it will use a default value of 8, which is what you've found. Perhaps 16 would be a better value for this, since that's the default MMA MRHS sizes in CMake?

Perhaps it would also be a good idea to have fall back to non-MMA dslash if the requested size isn't available? That would make things more bullet proof? Perhaps with a warning on first call?

…rrect level. Default test batch size for eigenvalue computation is now 16 (to match the default mma nvec instantiation

tests/utils/set_params.cpp

…his. Remove duplicate code

…eature/mrhs-solvers

…into feature/mrhs-solvers

weinbe2

This has passed my visual review and my tests with MILC. Awesome PR @maddyscientist !

include/dirac_quda.h

lib/inv_gcr_quda.cpp

lib/solver.hpp

bjoo

I had a very cursory look. I haven't had a chance to test the very latest with Chroma, but one or two commits out I think we fixed all the Chroma issues. So I am happy to approve. This is a great change.

…eature/mrhs-solvers

include/invert_x_update.h

include/kernels/dslash_coarse_mma.cuh

include/targets/cuda/reduce_helper.h

maddyscientist added 30 commits June 21, 2024 10:36

Fix warning in spin taste and minor cleanup

4951f11

Some cleanup of CG interface

7c4793b

Add MRHS interface for all solvers, and mandate source vector is const

9a2190c

Optimize DiracWilson: vectorize the prepare/reconstruct functions

fda4669

Small cleanup to block_transpose.in.cu

44fb98a

Add new parameter: QudaMultigridParam::n_vec_batch which is the batch…

fdd40fb

… size to use when generating null-space vectors. This is the parameter used to enable MRHS null space generation. Updated the null-space generation to work on vectors of this width

Vectorize DiracCoarsePC prepare/reconstruct

fa64adf

Ensure we don't enable large arg support for pre Volta architecture

fef58e8

Create vector variants of create_alias

8b8cd99

Add some more scalar wrappers: this facilitates us making the vector …

ee6fd26

…-> scalar cast operator explicit instead of implicit

Supress annoying warning with Eigen

ac23c73

Add default copy/move constructors/assignment operator for XUpdateBat…

2b0763b

…ch - will be needed for batched CG

Add some useful overloads to vector class to facilitate writing batch…

70a94df

…ing solvers

Add explicit casting to double in anticipation of making the cast ope…

f792a33

…rator explict

First pass at enabling MRHS for CG, MR and SD solvers. To better find…

57ba15e

… issues at compile time, the scalar -> vector<scalar> cast operator is now explicit. Apply necessary changes to ensure that stuff doesn't break with this change

Accelerate MG::verify by using batch blas where applicable

224bdb2

Fix bug in MRE solver

6256391

Apply MRHS optimization to MRE solver

7962dc3

Remove complex.h inclusion

075cfb8

Vectorize all remaining Dirac prepare/reconstruct functions

7cbab27

Fix bug in GammaApply with introduced in #1416

d488607

Fix issue with CG::hq_solve

6d1bafe

Merge branch 'develop' of github.com:lattice/quda into feature/mrhs-s…

d526544

…olvers

Fix bug with Clover Hasenbsusch operator (wrong braces)

902d8ab

Fix bug with DiracCoarsePC::reconstruct when using odd solve

02eecaa

Fix bug with counting bytes with clover operator

8b067d4

Default inner GCR solver to use L2 residual to enable early exit if p…

c682cae

…ossible

Initial work to prepare for multi-rhs solver exposure: move the body …

1c5baef

…of invertQuda to new function solve which is MRHS aware

Fix flops counters for blas and reduce functions

faf4658

Move remainder of invertQuda body into new MRHS solve wrapper that is…

b2f9849

… also now called by the invertMultiSrcQuda interface

maddyscientist added 4 commits October 1, 2024 22:27

Fix CG3 for MRHS

1e28221

Fix clover force test

9936354

Heterogeneous reductions now break up the device-local partial read a…

a0184d6

…nd writes into atoms. This fixes an issue with the MR-based Schwarz solver when using heterogenous reductions with batched solves

ctest should use mrhs for asqtad solver test

e51c59c

weinbe2 reviewed Oct 4, 2024

View reviewed changes

Fix typo

70a3b75

maddyscientist added 2 commits October 7, 2024 04:23

Fix for QudaMultigridParam::dslash_use_mma so that it respects the co…

9d4abe9

…rrect level. Default test batch size for eigenvalue computation is now 16 (to match the default mma nvec instantiation

Apply clang format

a9ef50b

weinbe2 reviewed Oct 8, 2024

View reviewed changes

tests/utils/set_params.cpp Show resolved Hide resolved

weinbe2 and others added 5 commits October 8, 2024 02:14

Updated the MILC HISQ MG interface for setting batch sizes

5c3192a

Set QudaMultigridParam::n_vec_batch to invalid to force user to set t…

7903288

…his. Remove duplicate code

Merge branch 'feature/mrhs-solvers' of github.com:lattice/quda into f…

e862afd

…eature/mrhs-solvers

Made nvec_batch more robust in the MILC HISQ MG interface

7a5fb37

Merge branch 'feature/mrhs-solvers' of https://github.com/lattice/quda …

4af762d

…into feature/mrhs-solvers

weinbe2 approved these changes Oct 9, 2024

View reviewed changes

bump CPM (silences some warnings with newer cmake)

2d56bfd

bjoo reviewed Oct 9, 2024

View reviewed changes

include/dirac_quda.h Show resolved Hide resolved

bjoo reviewed Oct 9, 2024

View reviewed changes

lib/inv_gcr_quda.cpp Show resolved Hide resolved

bjoo reviewed Oct 9, 2024

View reviewed changes

lib/solver.hpp Outdated Show resolved Hide resolved

bjoo approved these changes Oct 9, 2024

View reviewed changes

maddyscientist added 2 commits October 9, 2024 14:16

Fix typo

05b2bc6

Merge branch 'feature/mrhs-solvers' of github.com:lattice/quda into f…

4cef59f

…eature/mrhs-solvers

hummingtree approved these changes Oct 9, 2024

View reviewed changes

include/invert_x_update.h Show resolved Hide resolved

include/kernels/dslash_coarse_mma.cuh Show resolved Hide resolved

include/targets/cuda/reduce_helper.h Show resolved Hide resolved

maddyscientist merged commit eead44e into develop Oct 10, 2024
14 checks passed

maddyscientist deleted the feature/mrhs-solvers branch October 10, 2024 08:42

SaltyChiang added a commit to CLQCD/PyQUDA that referenced this pull request Oct 12, 2024

Update QUDA to lattice/quda#1489.

c1c8baf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mrhs solvers #1489

Feature/mrhs solvers #1489

maddyscientist commented Sep 5, 2024 •

edited

Loading

maddyscientist commented Oct 3, 2024 •

edited

Loading

weinbe2 Oct 4, 2024

weinbe2 commented Oct 4, 2024

weinbe2 commented Oct 4, 2024

maddyscientist commented Oct 4, 2024 •

edited

Loading

weinbe2 left a comment

bjoo left a comment

		@@ -371,10 +370,12 @@ namespace quda {

		if (!param.is_preconditioner) { // do not do the below if we this is an inner solver

Feature/mrhs solvers #1489

Feature/mrhs solvers #1489

Conversation

maddyscientist commented Sep 5, 2024 • edited Loading

maddyscientist commented Oct 3, 2024 • edited Loading

weinbe2 Oct 4, 2024

Choose a reason for hiding this comment

weinbe2 commented Oct 4, 2024

weinbe2 commented Oct 4, 2024

maddyscientist commented Oct 4, 2024 • edited Loading

weinbe2 left a comment

Choose a reason for hiding this comment

bjoo left a comment

Choose a reason for hiding this comment

maddyscientist commented Sep 5, 2024 •

edited

Loading

maddyscientist commented Oct 3, 2024 •

edited

Loading

maddyscientist commented Oct 4, 2024 •

edited

Loading