Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

optimizations added w.r.t. threading #1151

Closed

Conversation

mywoodstock
Copy link
Member

This PR is to merge in all the updates added to optimize threading implementation:

  1. Statements reorganization in 'btr se subcycle loop' to avoid redundant computations, loop fusions to fuse initialization loops with the main loops.
  2. Removal of unnecessary threading barriers.
  3. Implementation of threading into the mpas reconstruct routine.
  4. Reorganization of statements in 'diagnostic solve' to merge initializations with main loops, removal of extra barriers, vectorization and reorders.
  5. Changing MPI threading level from multiple to funneled.
  6. Reorganization in buffer pack and unpack in halo exchanges to minimize use of barriers.
  7. Implementation of threaded memory buffer initializations.

@toddringler
Copy link
Contributor

@mywoodstock Thanks for making this PR. The plan was to make two PRs -- the first that is bit-for-bit and the second that is not. Is this PR bit-for-bit and, if not, can it be separated into two PRs? Thanks again for this effort. Cheers, Todd.

@mywoodstock
Copy link
Member Author

mywoodstock commented Nov 28, 2016 via email

@mywoodstock
Copy link
Member Author

mywoodstock commented Nov 28, 2016 via email

@mark-petersen
Copy link
Contributor

@mywoodstock, thanks for separating and labeling the commits for b-f-b. The NOT bfb ones are first, so you can make a new local branch which from 45f4ab7 which is "NOT bit-reproducible". Push it to your fork and make a PR from it that includes "NOT bit-reproducible". Then this PR will list all of them, but be the "bit reproducible" ones once the first one is merged. Make a comment on both PRs that the other (45f4ab7) should be done first, and this one ( bc1ddc2) should be done second.

@mark-petersen
Copy link
Contributor

@mywoodstock it would be helpful if you could post some results on performance improvements and difference in solution to this PR. You can post pdf or jpg, whatever you already have.

@mark-petersen
Copy link
Contributor

@mywoodstock It would also be good to see the performance improvement for mpi-only for this PR, and your recommendation for the openMP versus mpi lay-out, at least on the machines and configurations that you tested.

@mark-petersen
Copy link
Contributor

Running the standard test case: global_ocean/QU_240km/performance_test

cd /lustre/scratch3/turquoise/mpeterse/runs/t39d/ocean/global_ocean/QU_240km/performance_test/forward
wf293.localdomain> tail -n 18 log.0000.err

 *********************************************************************************************************
 INFO: The split explicit time integration is configured to use:           20  barotropic subcycles
 *********************************************************************************************************


Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2B0C99BCF2F7
#1  0x2B0C99BCF8FE
#2  0x2B0C9B05A65F
#3  0x10336EA in __ocn_diagnostics_MOD_ocn_diagnostic_solve
#4  0x1042C2B in __ocn_init_routines_MOD_ocn_init_routines_block
#5  0x1000A5B in __ocn_forward_mode_MOD_ocn_forward_mode_init
#6  0x1150C00 in __ocn_core_MOD_ocn_core_init
#7  0x40B3F7 in __mpas_subdriver_MOD_mpas_init
#8  0x4085B8 in MAIN__ at mpas.F:0

That was compiled with gnu and DEBUG=true. Compiling with intel and debug will produce line numbers.

@mywoodstock
Copy link
Member Author

mywoodstock commented Dec 7, 2016 via email

@mark-petersen
Copy link
Contributor

@mywoodstock (cc @toddringler) , I was able to run your branch on edison and was stopped by an error on your branch. It may be the same as the one posted above.

I am comparing these commits:
This is the last commit on your branch (call it 'yours')

* bc1ddc2 adding some of the missed updates
run: /scratch2/scratchdirs/mpeterse/runs/t39i/ocean/global_ocean/QU_240km/performance_test/forward

This is where you branched off from (call it 'mpas-o')

*   c3526e8 Merge branch 'ocean/salinity_limit' into ocean/develop`
run: /scratch2/scratchdirs/mpeterse/runs/t39j/ocean/global_ocean/QU_240km/performance_test/forward

I can run a successful test case with mpas-o, and the same one dies with your branch with a Nan on the first time step. I am compiling with intel with debug on. You can both look at my output, and try to reproduce my results. The error is where the Nan is caught, but I believe the actual error in the code is earlier when the Nan is actually computed. I am running in our standard configuration, with no openmp.

To run this test, copy this directory to your own space:
/scratch2/scratchdirs/mpeterse/runs/t39i
copy twice, once for 'mpas-o' and once for your branch, for comparison. Compile using:

module load cray-netcdf/4.3.3.1 cray-parallel-netcdf/1.6.1
module load metis/5.1.0

setenv NETCDF ${NETCDF_DIR}
setenv PNETCDF ${PARALLEL_NETCDF_DIR}
setenv PIO /global/homes/a/asarje/edison/pio1_6_3/pio

make intel-nersc CORE=ocean DEBUG=true

Then link the executable ocean_model here (but on your file space)

edison05> cd /scratch2/scratchdirs/mpeterse/runs/t39i/ocean/global_ocean/QU_240km/performance_test/forward
Directory: /scratch2/scratchdirs/mpeterse/runs/t39i/ocean/global_ocean/QU_240km/performance_test/forward
edison05> ls -lh ocean_model
lrwxrwxrwx 1 mpeterse mpeterse 58 Oct 19 05:47 ocean_model -> /global/u2/m/mpeterse/repos/MPAS/ocean_develop/ocean_model
edison05> cd /scratch2/scratchdirs/mpeterse/runs/t39j/ocean/global_ocean/QU_240km/performance_test/forward
Directory: /scratch2/scratchdirs/mpeterse/runs/t39j/ocean/global_ocean/QU_240km/performance_test/forward
edison05> ls -lh ocean_model
lrwxrwxrwx 1 mpeterse mpeterse 69 Dec  8 06:25 ocean_model -> /global/u2/m/mpeterse/repos/MPAS/develop-optimized_abinov/ocean_model

I ran it as follows. Interactive login:
salloc -N 1 -p debug -L SCRATCH
cd to run directory, load same modules as above, and
srun -n 16 ocean_model

You can see the error messages in the log*err files.

@mark-petersen
Copy link
Contributor

@mywoodstock Looking at the files that have been changed, I see 5 are in src/core_ocean and 4 are in src/framework or src/operators. Those are two difference categories of pull requests. The first is a PR into MPAS-Dev:ocean/develop, and the second is a pull request into MPAS-Dev:develop. Sorry I didn't look at this earlier.

The immediate question is: If we committed those two types of files separately, would there be errors? That is, if we merged just the src/core_ocean in this commit, and just the src/framework and src/operators in another commit, would each work independently?

Once the later is merged, does it potentially require other cores (atmosphere, sea-ice, land-ice) to make any changes in their code in src/core_* that call the altered framework routines?

@mywoodstock
Copy link
Member Author

@mark-petersen Let me make sure that the src/core_ocean changes can be merged independently of the src/framework and src/operators.
Once everything is merged in, the other cores need not make any changes is they are not using OpenMP, but if they are and the changes in src/operators are within an omp parallel region without any nesting, it will be fine. If their OpenMP threading has nesting, then will need to double check if it doesn't disrupt any of that (in typical threading scenarios, it should work as is).

@mywoodstock
Copy link
Member Author

@mark-petersen On another note, I forgot to mention earlier, but when I compile the code on Cori, I always get the following error, and then it compiles fine if I comment out the last argument in that routine call. Here is the error:

mpas_ocn_vmix_cvmix.F(920): error #6627: This is an actual argument keyword name, and not a dummy argument name. [LENHANCED_DIFF]
lenhanced_diff = config_cvmix_kpp_use_enhanced_diff)
---------------^
compilation aborted for mpas_ocn_vmix_cvmix.F (code 1)
Makefile:195: recipe for target 'mpas_ocn_vmix_cvmix.o' failed
make[4]: *** [mpas_ocn_vmix_cvmix.o] Error 1
make[4]: Leaving directory '/global/u1/a/asarje/cori/mpas/MPAS.git/src/core_ocean/shared'
Makefile:71: recipe for target 'shared' failed
make[3]: *** [shared] Error 2
make[3]: Leaving directory '/global/u1/a/asarje/cori/mpas/MPAS.git/src/core_ocean'
Makefile:39: recipe for target 'dycore' failed
make[2]: *** [dycore] Error 2
make[2]: Leaving directory '/global/u1/a/asarje/cori/mpas/MPAS.git/src'
Makefile:617: recipe for target 'mpas_main' failed
make[1]: *** [mpas_main] Error 2
make[1]: Leaving directory '/global/u1/a/asarje/cori/mpas/MPAS.git'
Makefile:300: recipe for target 'intel-nersc' failed
make: *** [intel-nersc] Error 2

@vanroekel
Copy link
Contributor

vanroekel commented Dec 21, 2016

@mywoodstock I just tried to compile a fresh checkout of ocean/develop on cori and did not see the issue you mentioned here. My guess is that your version of cvmix is not up to date. We recently changed the git tag of the cvmix library we pull for the model and this new tag introduced the lenhanced_diff flag. From within MPAS if you go to src/core_ocean/cvmix and do git describe --tags,

v0.84-beta

is the current tag of cvmix for MPAS. Yet, I'm guessing for your testing, just commenting out the argument as you have done is just fine.

@mark-petersen
Copy link
Contributor

@mywoodstock same question here as on #1164. What if I split the MPAS framework changes (files in directory src/framework and src/operators into a different pull request? Could I merge the remaining files in this PR? i.e. the changes in the different files are not dependent on each other, right?

The ocean changes could be tested and merged in right away by me. The framework changes require approval by others, so could be on a much longer timescale.

Do you foresee any problems if ACME ran with only the changes in the ocean files for a while, and without the framework changes?

@mark-petersen
Copy link
Contributor

This PR was rebased and separated into #1235, #1236, and #1237

mark-petersen added a commit that referenced this pull request May 9, 2017
This PR replaces #1151, and includes only bfb changes to framework.
This includes:

1. Implementation of threading into the mpas reconstruct routine.
2. Changing MPI threading level from multiple to funneled.
3. Reorganization in buffer pack and unpack in halo exchanges to
   minimize use of barriers.
4. Implementation of threaded memory buffer initializations.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants