Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mrhs solvers #1489

Merged
merged 114 commits into from
Oct 10, 2024
Merged

Feature/mrhs solvers #1489

merged 114 commits into from
Oct 10, 2024

Conversation

maddyscientist
Copy link
Member

@maddyscientist maddyscientist commented Sep 5, 2024

This is PR is a biggie:

  • Adds MRHS solvers to QUDA (at present CG, MR, SD, GCR, CA-GCR) are all implemented
    • Reliable updates performed using 0th RHS
    • Do not flag convergence until all RHS are converged
  • Adds MRHS support to multigrid
    • Batched null space finding implemented, exposed with new parameter QudaMultigridParam::n_vec_batch
    • MRHS supported in actual solves as well
  • Expose MRHS solvers using invertMultiSrcQuda interface
    • This is compatible with the split-grid interface, batching is used across the number of sources per sub grid
  • Explicit breaking of the interface
    • QudaInvertParam::true_res and QudaInvertParam::true_res_hq are now arrays
  • Batched deflation implemented
    • Batch eigenvalue deflation confirmed working using staggered fermions
    • Batch singular-value deflation confirmed working as a coarse grid deflator
  • Eigenvalue computation is now batch, to take advantage of MRHS
  • All Dirac::prepare and Dirac::reconstruct functions are now MRHS optimized
  • Chronological solver is now MRHS optimized
  • Cast from cvector<T> to T is now explicit instead of implicit
    • This makes it much easier to catch bugs when updating code to be MRHS
  • Tensor-core 3xFP16 DslashCoarse is now robust to underflow
  • Miscellaneous cleanup and additions to aid all of the above
  • Fixes some earlier bugs introduced in prior MRHS PRs
  • Add MMA instantiations for 32 -> 64 coarsening
  • Since QUDA is now threaded, update to using MPI_THREAD_FUNNELED
  • Augmentations to the power monitoring

Things left to do

  • Add real MRHS support to all solvers (or at least to a few outstanding ones: CGNE, CGNR, BiCGStab, etc.)
  • Verify NVSHMEM MRHS operators are all working as expected
  • Update QUDA version number due to breakage of interface (QUDA 2.0?)
  • Fix reporting of true residual per RHS when running split grid

… size to use when generating null-space vectors. This is the parameter used to enable MRHS null space generation. Updated the null-space generation to work on vectors of this width
…-> scalar cast operator explicit instead of implicit
… issues at compile time, the scalar -> vector<scalar> cast operator is now explicit. Apply necessary changes to ensure that stuff doesn't break with this change
…of invertQuda to new function solve which is MRHS aware
… also now called by the invertMultiSrcQuda interface
…nd writes into atoms. This fixes an issue with the MR-based Schwarz solver when using heterogenous reductions with batched solves
@maddyscientist
Copy link
Member Author

maddyscientist commented Oct 3, 2024

This PR is now functionally complete, and all tests are passing. This is ready for final review (@weinbe2 @hummingtree @mathiaswagner @bjoo).

@@ -371,10 +370,12 @@ namespace quda {

if (!param.is_preconditioner) { // do not do the below if we this is an inner solver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of typo fixing, if we this is -> if this is

@weinbe2
Copy link
Contributor

weinbe2 commented Oct 4, 2024

I have tested the batch CG solver with a modified version of MILC that properly utilizes the multisource MILC interface function. This is available here: https://github.com/lattice/milc_qcd/tree/feature/quda-block-solver-interface ; current commit is lattice/milc_qcd@f0404fe . This PR works perfectly fine with the current develop version of MILC.

I will note that this has only tested vanilla CG. I have not yet plumbed in multi-rhs support for the MG solver; I consider that within the scope of a second QUDA PR.

@weinbe2
Copy link
Contributor

weinbe2 commented Oct 4, 2024

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

@maddyscientist
Copy link
Member Author

maddyscientist commented Oct 4, 2024

When using coarsest-level deflation (perhaps just with staggered operators?) it looks like we need to change the default values corresponding to the flag --mg-eig-evals-batch-size 2 [###]. I uncovered this with coarsest-level deflation for staggered operators, coarse Nc = 64 or 96, on sm_80. I hit the following error when trying to converge 16 eigenvalues:

[...]
MG level 2 (GPU): RitzValue[0015]: (+2.6378149309623524e-03, +0.0000000000000000e+00) residual 1.4461873963747640e-05
MG level 2 (GPU): ERROR: nVec = 8 not instantiated
 (rank 0, host ipp1-1780.nvidia.com, block_transpose.cu:116 in void quda::launch_span_nVec(v_t&, quda::cvector_ref<O>&, quda::IntList<nVec, N ...>) [with v_t = quda::ColorSpinorField; b_t = const quda::ColorSpinorField; vFloat = float; bFloat = float; int nSpin = 2; int nColor = 64; int nVec = 16; int ...N = {}; quda::cvector_ref<O> = const quda::vector_ref<const quda::ColorSpinorField>]())
MG level 2 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=8192,aux=zero,color_spinor_field.cpp,406)

I compiled with MG MRHS support for 16 and 32; I can't think of anywhere that I've imposed a multiple of 8. The solution was explicitly setting --mg-eig-evals-batch-size 2 16 (where 2 is the lowest level).

My (relatively reduced) command is:

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.1 --recon 13 --recon-sloppy 9 --recon-precondition 9 \
  --dim 8 8 8 8 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 3 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 \
  --mg-setup-tol 1 1e-5 --mg-setup-inv 1 cgnr \
  --nsrc 1 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 ca-gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 --mg-coarse-solver-ca-basis-size 2 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose \
  --mg-eig 2 true --mg-eig-type 2 trlm --mg-eig-use-dagger 2 false --mg-eig-use-normop 2 true \
  --mg-nvec 2 16 --mg-eig-n-ev 2 16 --mg-eig-n-kr 2 128 --mg-eig-tol 2 1e-1 \
  --mg-eig-use-poly-acc 2 false --mg-eig-poly-deg 2 100 --mg-eig-amin 2 1e-1 \
  --mg-eig-max-restarts 2 1000

Neither toggling --mg-setup-use-mma 2 false nor --mg-dslash-use-mma 2 false works around this.

I can't quite think of a good way to address this (yet), but I'm also not clear on the details in the weeds. Maybe you know exactly where the fix is @maddyscientist ?

Ok, I understand this issue. There's two things at play here:

  • For whatever reason --mg-dslash-use-mma i acts on the i + 1 level, so you should set --mg-dslash-use-mma 1 false`, somewhat counter intuitively. This was likely an oversight from when the MMA dslash was added. I can fix this.
  • If the evec batch size isn't set at the command line, it will use a default value of 8, which is what you've found. Perhaps 16 would be a better value for this, since that's the default MMA MRHS sizes in CMake?

Perhaps it would also be a good idea to have fall back to non-MMA dslash if the requested size isn't available? That would make things more bullet proof? Perhaps with a warning on first call?

…rrect level. Default test batch size for eigenvalue computation is now 16 (to match the default mma nvec instantiation
Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has passed my visual review and my tests with MILC. Awesome PR @maddyscientist !

lib/solver.hpp Outdated Show resolved Hide resolved
Copy link
Member

@bjoo bjoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very cursory look. I haven't had a chance to test the very latest with Chroma, but one or two commits out I think we fixed all the Chroma issues. So I am happy to approve. This is a great change.

include/invert_x_update.h Show resolved Hide resolved
include/kernels/dslash_coarse_mma.cuh Show resolved Hide resolved
include/targets/cuda/reduce_helper.h Show resolved Hide resolved
@maddyscientist maddyscientist merged commit eead44e into develop Oct 10, 2024
14 checks passed
@maddyscientist maddyscientist deleted the feature/mrhs-solvers branch October 10, 2024 08:42
SaltyChiang added a commit to CLQCD/PyQUDA that referenced this pull request Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants