Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOM_hor_visc: horizontal_viscosity loop reorder #1287

Merged
merged 3 commits into from
Jan 19, 2021

Conversation

marshallward
Copy link
Collaborator

This patch reorders many of the loops in horizontal_viscosity in order
to improve vectorization of the Laplacian and biharmonic viscosities.

Specifically, a single loop containing many different computations were
broken up into many loops of individual operations. This patch required
introduction of several new 2D arrays.

On Gaea's Broadwell CPUs (E5-2697 v4), this is a ~80% speedup on a
32x32x75 benchmark configuration. Smaller speedups were observed on
AMD processors.

On the Gaea nodes, performance appears to be limited by the very large
number of variables in the function stack, and the high degree of stack
spill. Further loop reordering may cause slowdowns unless the stack
usage is reduced.

No answers should be changed by this patch, but deserves extra scrutiny
given the fundamental role of this function in nearly all simulations.

marshallward and others added 2 commits January 12, 2021 14:14
This patch reorders many of the loops in horizontal_viscosity in order
to improve vectorization of the Laplacian and biharmonic viscosities.

Specifically, a single loop containing many different computations were
broken up into many loops of individual operations.  This patch required
introduction of several new 2D arrays.

On Gaea's Broadwell CPUs (E5-2697 v4), this is a ~80% speedup on a
32x32x75 `benchmark` configuration.  Smaller speedups were observed on
AMD processors.

On the Gaea nodes, performance appears to be limited by the very large
number of variables in the function stack, and the high degree of stack
spill.  Further loop reordering may cause slowdowns unless the stack
usage is reduced.

No answers should be changed by this patch, but deserves extra scrutiny
given the fundamental role of this function in nearly all simulations.
@codecov-io
Copy link

Codecov Report

Merging #1287 (5c93def) into dev/gfdl (0bd16f4) will increase coverage by 0.07%.
The diff coverage is 70.11%.

Impacted file tree graph

@@             Coverage Diff              @@
##           dev/gfdl    #1287      +/-   ##
============================================
+ Coverage     45.82%   45.90%   +0.07%     
============================================
  Files           227      225       -2     
  Lines         71552    71507      -45     
============================================
+ Hits          32791    32825      +34     
+ Misses        38761    38682      -79     
Impacted Files Coverage Δ
src/parameterizations/lateral/MOM_hor_visc.F90 65.70% <70.11%> (-0.94%) ⬇️
src/framework/MOM_io.F90 42.95% <0.00%> (-10.62%) ⬇️
src/framework/MOM_restart.F90 37.31% <0.00%> (-0.06%) ⬇️
src/core/MOM_open_boundary.F90 30.76% <0.00%> (ø)
src/core/MOM_dynamics_unsplit.F90 91.76% <0.00%> (ø)
src/diagnostics/MOM_sum_output.F90 63.88% <0.00%> (ø)
src/tracer/MOM_tracer_diabatic.F90 38.09% <0.00%> (ø)
src/core/MOM_dynamics_split_RK2.F90 88.94% <0.00%> (ø)
src/core/MOM_dynamics_unsplit_RK2.F90 92.71% <0.00%> (ø)
src/framework/MOM_io_wrapper.F90
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0bd16f4...7114d63. Read the comment docs.

@@ -519,22 +528,25 @@ subroutine horizontal_viscosity(u, v, h, diffu, diffv, MEKE, VarMix, G, GV, US,
! shearing strain advocated by Smagorinsky (1993) and discussed in
! Griffies and Hallberg (2000).

! Calculate horizontal tension
do j=Jsq-1,Jeq+2 ; do i=Isq-1,Ieq+2
do j=Jsq-2,Jeq+2 ; do i=Isq-2,Ieq+2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By fusing the two loops here (the do-lop blocks that were previously at lines 523 and 533), there are a mix of locations, and the stencil size for the values of u and v that are used are expanded by one point in the i- and j- directions respectively. Also the case-sensitive indexing convention that we once considered using to automatically flip between a SW and a NE indexing convention is broken. Elsewhere in this commit, performance is improved by breaking up large loops, and the only variables shared between these loops are u and v, and not any of the metric arrays. Is the performance gains from fusing these two specific loops large enough to justify these potential costs? Would you consider undoing this particular loop fusion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was a bit worried abotu this one for the reasons that you mention (potentially invalid range and loss of IJ/ij syntax).

I fused these because there was a measureable speedup, and decided that was justification enough, but in the end it was one small improvement over many and could be sacrificed.

But we should probably discuss it a bit further, given that it is faster to fuse them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my home machine (AMD Ryzen 5 2600, 2133 MT/s RAM), my benchmark case is about 1-2% slower if I revert this change. (0.998s, 1.012s vs 1.026s 1.020s).

Given that this is actually measurable, perhaps we (aka I) should try to work out some solution which preserves both the speedup and the loop separation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the right answer here is to split the loops so we can accept this PR, despite the performance hits, and then look at options for recovering this speedup with a second PR. Alternately, given that this is holding up a number of other PRs but does not seem to interact with them, you could green-light deferring action on this PR so we can move on and sort out the interactions between the next several commits and get ready for the pre-FMS2 PR to main.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert the loop but add a comment explaining the potential speedup.

I am almost certain this is going to come up more and more as we dig into these loops, so at the least I want to remind myself of the issue.

if (CS%add_LES_viscosity) then
if (CS%Smagorinsky_Kh) Kh = Kh + CS%Laplac2_const_xx(i,j) * Shear_mag
if (CS%Leith_Kh) Kh = Kh + CS%Laplac3_const_xx(i,j) * vert_vort_mag*inv_PI3
if (CS%Smagorinsky_Kh) &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the code in the 13-line block from the new lines 888 - 900 be a good candidate for restructuring to put the logical tests outside of the do-loops, as was found to be more efficient elsewhere in this PR?

Copy link
Collaborator Author

@marshallward marshallward Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one example of several others, but unfortunately I started to experience some of the "stack spill" issues that we discussed privately, where the function begins to slow down rather than speed up.

At this point, I am too worried to make further changes until we figure out how to reduce the function stack (and perhaps confirm a little more thoroughly that this is what is actually going on, and that reducing the function stack would help here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment block appears later in a similar section. Perhaps another disclaimer is needed here...?

! NOTE: The following do-block can be decomposed and vectorized, but
!   appears to cause slowdown on some machines.  Evidence suggests that
!   this is caused by excessive spilling of stack variables.
! TODO: Vectorize these loops after stack usage has been reduced..

The fusion of tension and shear strains yields a 1-2% speedup, but also
breaks the style convention of capitalization of vertex points, and also
evaluates the tension terms over a slightly larger domain, so it has
been reverted.

A note has been added to investigate this later.
Copy link
Collaborator

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes in this latest commit, all of my concerns have been addressed, and I think this PR is acceptable (and a very nice contribution).

@Hallberg-NOAA
Copy link
Collaborator

This PR passed the pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/11881 .

@Hallberg-NOAA Hallberg-NOAA merged commit be5fb70 into mom-ocean:dev/gfdl Jan 19, 2021
@marshallward marshallward deleted the hor_visc_2x branch May 7, 2021 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants