Add isContiguous check for new AmgX API #26

mhrywniak · 2019-07-22T09:13:43Z

The new AmgX API allows passing in partition offsets instead of a full partition vector.
Perform this check using the PETSc index set API to transparently enable the optimization.

Updated documentation (dependencies)
Added timing to poisson example to allow verification

I verified that the optimization works on the poisson example by doing the following:

Compile/load all libraries: AmgX latest (with the new API), PETSc latest (OpenMPI, GCC), CUDA 10.1
Compiling the poisson example using both AmgX and the PETSc build (called ompi_gcc_opt):

cd AmgXWrapper/examples/poisson && mkdir build && cd build
cmake -DPETSC_DIR=$TOOLS/petsc -DPETSC_ARCH=ompi_gcc_opt \
-DCUDA_DIR=$CUDA_HOME -DAMGX_DIR=$TOOLS/AMGX ..
make

Running on a single DGX-1V node two cases with 4 and 8 MPI ranks, i.e. 4 and 8 GPUs:
N=400 time -p mpirun -np 4 -x N \ bash -c "bin/poisson -caseName Test -mode AmgX_GPU -cfgFileName configs/AmgX_SolverOptions_AGG.info -Nx \$N -Ny \$N -Nz \$N -Nruns 1 -optFileName test"
Got results from log file with grep -i solving test.log -A2 | awk '{printf("%-10s%s\n",$1, $4)}'. (the command just filters out the runtime in seconds of the regions of interest)
Running with the optimization turned off/on (manually, in source) gives these times, showing that the setA routine is now scaling along with the solve call. Scaling is not perfect as this matrix is still somewhat small, just to be able to run a quick benchmark on a single node for this PR - I've tested this separately on multi-node GPU with larger matrix sizes, too.

OPTIMIZATION OFF          OPTIMIZATION ON  
With np=4                 With np=4
Solving   7.6343e+00      Solving   7.6004e+00
WarmUp    7.6985e+00      WarmUp    7.6580e+00
setA      5.8078e+00      setA      4.1525e+00
----------------------    --------------------
With np=8                 With np=8   
Solving   4.8465e+00      Solving   4.8461e+00
WarmUp    4.8944e+00      WarmUp    4.8850e+00
setA      4.4951e+00      setA      2.5223e+00
--------------------      --------------------

I also verified that using N_ranks > N_gpus still works, uses the optimized path (i.e. indices are contiguous) and gets a slight speedup, though overall runtime increased (presumably due to the consolidation overhead).

The new AmgX API allows passing in partition offsets instead of a full partition vector. Perform this check using the PETSc index set API to transparently enable the optimization. * Updated documentation (dependencies) * Added timing to poisson example to allow verification

piyueh · 2019-07-22T18:39:49Z

Thanks @mhrywniak ! I'm now reviewing it!

piyueh · 2019-07-23T23:35:07Z

This PR looks good to me, though there is a minor issue I would like to mention: the PR breaks the compatibility to older AmgX. The new code introduced in this PR makes the wrapper not able to work with older AmgX (any versions before the commit aba9132). However, this is fine for me because I believe there are not many AmgXWrapper users. And hence I think this issue is very minor.

The PR will now be merged and closed.

Appendix: a quick performance test

I'm not able to do a serious test because I now only have a personal desktop (i7-5930K + ASUS X99-Deluxe) with two K40c. So I just did a quick test to get some numbers. I tried with two smaller meshes (300x300x300 & 200x200x200) and both aggregation and classical multigrid algorithms. Three combinations of different versions of AmgX and AmgXWrapper were tested. The following results (wall time in seconds of the function setA) are merely the numbers from single runs, i.e., not the average of multiple runs. But they can still provide some sense.

AmgXWrapper v1.5 + AmgX 3049527

	1 K40c	2 K40c
300x300x300; Agg	7.30	6.02
200x200x200; Agg	2.02	1.84
200x200x200; Classical	4.42	2.32

AmgXWrapper v1.5 + AmgX aba9132

	1 K40c	2 K40c
300x300x300; Agg	7.95	5.19
200x200x200; Agg	2.10	1.43
200x200x200; Classical	4.43	2.25

AmgXWrapper PR26 + AmgX aba9132

	1 K40c	2 K40c
300x300x300; Agg	6.91	4.56
200x200x200; Agg	1.95	1.29
200x200x200; Classical	4.44	2.23

piyueh self-requested a review July 22, 2019 18:36

piyueh assigned piyueh and unassigned piyueh Jul 22, 2019

piyueh approved these changes Jul 23, 2019

View reviewed changes

piyueh merged commit 1a93865 into barbagroup:master Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add isContiguous check for new AmgX API #26

Add isContiguous check for new AmgX API #26

mhrywniak commented Jul 22, 2019 •

edited

Loading

piyueh commented Jul 22, 2019

piyueh commented Jul 23, 2019 •

edited

Loading

Add isContiguous check for new AmgX API #26

Add isContiguous check for new AmgX API #26

Conversation

mhrywniak commented Jul 22, 2019 • edited Loading

piyueh commented Jul 22, 2019

piyueh commented Jul 23, 2019 • edited Loading

Appendix: a quick performance test

mhrywniak commented Jul 22, 2019 •

edited

Loading

piyueh commented Jul 23, 2019 •

edited

Loading