Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isContiguous check for new AmgX API #26

Merged
merged 1 commit into from
Jul 23, 2019

Conversation

mhrywniak
Copy link
Contributor

@mhrywniak mhrywniak commented Jul 22, 2019

The new AmgX API allows passing in partition offsets instead of a full partition vector.
Perform this check using the PETSc index set API to transparently enable the optimization.

  • Updated documentation (dependencies)
  • Added timing to poisson example to allow verification

I verified that the optimization works on the poisson example by doing the following:

  1. Compile/load all libraries: AmgX latest (with the new API), PETSc latest (OpenMPI, GCC), CUDA 10.1
  2. Compiling the poisson example using both AmgX and the PETSc build (called ompi_gcc_opt):
cd AmgXWrapper/examples/poisson && mkdir build && cd build
cmake -DPETSC_DIR=$TOOLS/petsc -DPETSC_ARCH=ompi_gcc_opt \
-DCUDA_DIR=$CUDA_HOME -DAMGX_DIR=$TOOLS/AMGX ..
make
  1. Running on a single DGX-1V node two cases with 4 and 8 MPI ranks, i.e. 4 and 8 GPUs:
    N=400 time -p mpirun -np 4 -x N \ bash -c "bin/poisson -caseName Test -mode AmgX_GPU -cfgFileName configs/AmgX_SolverOptions_AGG.info -Nx \$N -Ny \$N -Nz \$N -Nruns 1 -optFileName test"
  2. Got results from log file with grep -i solving test.log -A2 | awk '{printf("%-10s%s\n",$1, $4)}'. (the command just filters out the runtime in seconds of the regions of interest)
  3. Running with the optimization turned off/on (manually, in source) gives these times, showing that the setA routine is now scaling along with the solve call. Scaling is not perfect as this matrix is still somewhat small, just to be able to run a quick benchmark on a single node for this PR - I've tested this separately on multi-node GPU with larger matrix sizes, too.
OPTIMIZATION OFF          OPTIMIZATION ON  
With np=4                 With np=4
Solving   7.6343e+00      Solving   7.6004e+00
WarmUp    7.6985e+00      WarmUp    7.6580e+00
setA      5.8078e+00      setA      4.1525e+00
----------------------    --------------------
With np=8                 With np=8   
Solving   4.8465e+00      Solving   4.8461e+00
WarmUp    4.8944e+00      WarmUp    4.8850e+00
setA      4.4951e+00      setA      2.5223e+00
--------------------      --------------------

I also verified that using N_ranks > N_gpus still works, uses the optimized path (i.e. indices are contiguous) and gets a slight speedup, though overall runtime increased (presumably due to the consolidation overhead).

The new AmgX API allows passing in partition offsets instead of a full partition vector.
Perform this check using the PETSc index set API to transparently enable the optimization.
* Updated documentation (dependencies)
* Added timing to poisson example to allow verification
@piyueh piyueh self-requested a review July 22, 2019 18:36
@piyueh piyueh assigned piyueh and unassigned piyueh Jul 22, 2019
@piyueh
Copy link
Member

piyueh commented Jul 22, 2019

Thanks @mhrywniak ! I'm now reviewing it!

@piyueh
Copy link
Member

piyueh commented Jul 23, 2019

This PR looks good to me, though there is a minor issue I would like to mention: the PR breaks the compatibility to older AmgX. The new code introduced in this PR makes the wrapper not able to work with older AmgX (any versions before the commit aba9132). However, this is fine for me because I believe there are not many AmgXWrapper users. And hence I think this issue is very minor.

The PR will now be merged and closed.


Appendix: a quick performance test

I'm not able to do a serious test because I now only have a personal desktop (i7-5930K + ASUS X99-Deluxe) with two K40c. So I just did a quick test to get some numbers. I tried with two smaller meshes (300x300x300 & 200x200x200) and both aggregation and classical multigrid algorithms. Three combinations of different versions of AmgX and AmgXWrapper were tested. The following results (wall time in seconds of the function setA) are merely the numbers from single runs, i.e., not the average of multiple runs. But they can still provide some sense.

  • AmgXWrapper v1.5 + AmgX 3049527
1 K40c 2 K40c
300x300x300; Agg 7.30 6.02
200x200x200; Agg 2.02 1.84
200x200x200; Classical 4.42 2.32
  • AmgXWrapper v1.5 + AmgX aba9132
1 K40c 2 K40c
300x300x300; Agg 7.95 5.19
200x200x200; Agg 2.10 1.43
200x200x200; Classical 4.43 2.25
  • AmgXWrapper PR26 + AmgX aba9132
1 K40c 2 K40c
300x300x300; Agg 6.91 4.56
200x200x200; Agg 1.95 1.29
200x200x200; Classical 4.44 2.23

@piyueh piyueh merged commit 1a93865 into barbagroup:master Jul 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants