-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MOM_hor_visc: horizontal_viscosity loop reorder #1287
Conversation
This patch reorders many of the loops in horizontal_viscosity in order to improve vectorization of the Laplacian and biharmonic viscosities. Specifically, a single loop containing many different computations were broken up into many loops of individual operations. This patch required introduction of several new 2D arrays. On Gaea's Broadwell CPUs (E5-2697 v4), this is a ~80% speedup on a 32x32x75 `benchmark` configuration. Smaller speedups were observed on AMD processors. On the Gaea nodes, performance appears to be limited by the very large number of variables in the function stack, and the high degree of stack spill. Further loop reordering may cause slowdowns unless the stack usage is reduced. No answers should be changed by this patch, but deserves extra scrutiny given the fundamental role of this function in nearly all simulations.
Codecov Report
@@ Coverage Diff @@
## dev/gfdl #1287 +/- ##
============================================
+ Coverage 45.82% 45.90% +0.07%
============================================
Files 227 225 -2
Lines 71552 71507 -45
============================================
+ Hits 32791 32825 +34
+ Misses 38761 38682 -79
Continue to review full report at Codecov.
|
@@ -519,22 +528,25 @@ subroutine horizontal_viscosity(u, v, h, diffu, diffv, MEKE, VarMix, G, GV, US, | |||
! shearing strain advocated by Smagorinsky (1993) and discussed in | |||
! Griffies and Hallberg (2000). | |||
|
|||
! Calculate horizontal tension | |||
do j=Jsq-1,Jeq+2 ; do i=Isq-1,Ieq+2 | |||
do j=Jsq-2,Jeq+2 ; do i=Isq-2,Ieq+2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By fusing the two loops here (the do-lop blocks that were previously at lines 523 and 533), there are a mix of locations, and the stencil size for the values of u and v that are used are expanded by one point in the i- and j- directions respectively. Also the case-sensitive indexing convention that we once considered using to automatically flip between a SW and a NE indexing convention is broken. Elsewhere in this commit, performance is improved by breaking up large loops, and the only variables shared between these loops are u and v, and not any of the metric arrays. Is the performance gains from fusing these two specific loops large enough to justify these potential costs? Would you consider undoing this particular loop fusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was a bit worried abotu this one for the reasons that you mention (potentially invalid range and loss of IJ/ij syntax).
I fused these because there was a measureable speedup, and decided that was justification enough, but in the end it was one small improvement over many and could be sacrificed.
But we should probably discuss it a bit further, given that it is faster to fuse them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my home machine (AMD Ryzen 5 2600, 2133 MT/s RAM), my benchmark case is about 1-2% slower if I revert this change. (0.998s, 1.012s vs 1.026s 1.020s).
Given that this is actually measurable, perhaps we (aka I) should try to work out some solution which preserves both the speedup and the loop separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the right answer here is to split the loops so we can accept this PR, despite the performance hits, and then look at options for recovering this speedup with a second PR. Alternately, given that this is holding up a number of other PRs but does not seem to interact with them, you could green-light deferring action on this PR so we can move on and sort out the interactions between the next several commits and get ready for the pre-FMS2 PR to main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert the loop but add a comment explaining the potential speedup.
I am almost certain this is going to come up more and more as we dig into these loops, so at the least I want to remind myself of the issue.
if (CS%add_LES_viscosity) then | ||
if (CS%Smagorinsky_Kh) Kh = Kh + CS%Laplac2_const_xx(i,j) * Shear_mag | ||
if (CS%Leith_Kh) Kh = Kh + CS%Laplac3_const_xx(i,j) * vert_vort_mag*inv_PI3 | ||
if (CS%Smagorinsky_Kh) & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the code in the 13-line block from the new lines 888 - 900 be a good candidate for restructuring to put the logical tests outside of the do-loops, as was found to be more efficient elsewhere in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is one example of several others, but unfortunately I started to experience some of the "stack spill" issues that we discussed privately, where the function begins to slow down rather than speed up.
At this point, I am too worried to make further changes until we figure out how to reduce the function stack (and perhaps confirm a little more thoroughly that this is what is actually going on, and that reducing the function stack would help here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment block appears later in a similar section. Perhaps another disclaimer is needed here...?
! NOTE: The following do-block can be decomposed and vectorized, but
! appears to cause slowdown on some machines. Evidence suggests that
! this is caused by excessive spilling of stack variables.
! TODO: Vectorize these loops after stack usage has been reduced..
The fusion of tension and shear strains yields a 1-2% speedup, but also breaks the style convention of capitalization of vertex points, and also evaluates the tension terms over a slightly larger domain, so it has been reverted. A note has been added to investigate this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes in this latest commit, all of my concerns have been addressed, and I think this PR is acceptable (and a very nice contribution).
This PR passed the pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/11881 . |
This patch reorders many of the loops in horizontal_viscosity in order
to improve vectorization of the Laplacian and biharmonic viscosities.
Specifically, a single loop containing many different computations were
broken up into many loops of individual operations. This patch required
introduction of several new 2D arrays.
On Gaea's Broadwell CPUs (E5-2697 v4), this is a ~80% speedup on a
32x32x75
benchmark
configuration. Smaller speedups were observed onAMD processors.
On the Gaea nodes, performance appears to be limited by the very large
number of variables in the function stack, and the high degree of stack
spill. Further loop reordering may cause slowdowns unless the stack
usage is reduced.
No answers should be changed by this patch, but deserves extra scrutiny
given the fundamental role of this function in nearly all simulations.