forked from CICE-Consortium/CICE
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ice_dyn_vp: use 'global_sum_prod' for bit-for-bit output reproducibility
The VP solver uses a linear solver, FGMRES, as part of the non-linear iteration. The FGMRES algorithm involves computing the norm of a distributed vector field, thus performing global sums. These norms are computed by first summing the squared X and Y components of a vector field in subroutine 'calc_L2norm_squared', summing these over the local blocks, and then doing a global (MPI) sum using 'global_sum'. This approach does not lead to reproducible results when the MPI distribution, or the number of local blocks, is changed, for reasons explained in the "Reproducible sums" section of the Developer Guide (mostly, floating point addition is not associative). This was partly pointed out in [1] but I failed to realize it at the time. Make the results of the VP solver more reproducible by using two calls to 'global_sum_prod' to individually compute the squares of the X and Y components when computing norms, and then summing these two reproducible scalars. The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the preconditioner 'pgmres' which uses the same algorithm, and the Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'. These changes result in twice the number of global sums for fgmres, pgmres and the MGS algorithm. For the CGS algorithm, the performance impact is higher as 'global_sum_prod' is called inside the loop, whereas previously we called 'global_allreduce_sum' after the loop to compute all 'initer' sums at the same time. To keep that optimization, we would have to implement a new interface 'global_allreduce_sum_prod' which would take two arrays of shape (nx_block,ny_block,max_blocks,k) and sum these over their first three dimensions before performing the global reduction over the k dimension. We choose to not go that route for now mostly because anyway the CGS algorithm is (by default) only used for the PGMRES preconditioner, and so the cost should be relatively low as 'initer' corresponds to 'dim_pgmres' in the namelist, which should be kept low for efficiency (default 5). These changes lead to bit-for-bit reproducibility (the decomp_suite passes) when using 'precond=ident' and 'precond=diag'. 'precond=pgmres' is still not bit-for-bit because some halo updates are skipped for efficiency. This will be addressed in a following commit. [1] CICE-Consortium#491 (comment)
- Loading branch information
1 parent
7705e13
commit 3c3bd60
Showing
1 changed file
with
25 additions
and
84 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters