-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPU execution of atm_advance_acoustic_step via OpenACC #1251
Enable GPU execution of atm_advance_acoustic_step via OpenACC #1251
Conversation
NOTE: This PR is paused. I am sorting out the merge conflicts and a run-time error. I will notify again when this PR is ready for review. |
58ba84a
to
0730fa2
Compare
0730fa2
to
09e60a5
Compare
Force-push 0730fa2 to 09e60a5 to consistently add new invariant fields at the end of sections in @mgduda and @abishekg7 this should be ready for review! |
|
||
!MGD this loop will not be very load balanced with if-test below | ||
|
||
!$acc parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add default(present)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, addressed now by fixup d7109c1
end if | ||
|
||
!$OMP BARRIER | ||
|
||
!$acc parallel | ||
!$acc loop gang private(ts,rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to specify gang worker
here instead of only gang
? I tried it out and it improves performance marginally, but also wondering if there's a reason we want to keep this as gang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think worker
wasn't specified in case I needed that level in this big loop. I can add it easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well after "easily" turned out to be harder than I thought, this is now addressed by fixup 2030ecc
@mgduda and @abishekg7 this is ready for review now if you want. I plan to squash this into one commit like #1237 later today if you'd rather review that. |
2030ecc
to
6ec497f
Compare
@mgduda and @abishekg7, force-push 2030ecc to 6ec497f squashed this to one commit. Let me know what you think! EDIT: caught a typo of mine, this second force-push fixed it. |
6ec497f
to
68253c3
Compare
<<<<<<< HEAD | ||
!$acc loop gang worker vector collapse(2) | ||
======= | ||
!$acc loop collapse(2) | ||
>>>>>>> d7109c12a (fixup! Add acc data movement to atm_advance_acoustic_step_work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict here fyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting it. I'll get that sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<<<<<<< HEAD | ||
!$acc loop gang worker private(ts,rs) | ||
======= | ||
!$acc loop gang private(ts,rs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
68253c3
to
6e40864
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with both limited area and J-W baroclinic cases, and get bit identical results with the develop branch. Looks good.
do k=1,nVertLevels | ||
pgrad = ((rtheta_pp(k,cell2)-rtheta_pp(k,cell1))*invDcEdge(iEdge) )/(.5*(zz(k,cell2)+zz(k,cell1))) | ||
pgrad = cqu(k,iEdge)*0.5*c2*(exner(k,cell1)+exner(k,cell2))*pgrad | ||
pgrad = pgrad + 0.5*zxu(k,iEdge)*gravity*(rho_pp(k,cell1)+rho_pp(k,cell2)) | ||
ru_p(k,iEdge) = ru_p(k,iEdge) + dts*(tend_ru(k,iEdge) - (1.0_RKIND - specZoneMaskEdge(iEdge))*pgrad) | ||
end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we avoid fusing loops for now in an attempt to keep the ported code as close to the original as possible. Taking this change specifically, it was around a decade ago that we specifically split the computation of ruAvg
into a separate k
-loop: see c0dae35. Unfortunately, the commit message for c0dae35 is rather terse, so I can only guess now that the reason was to improve vectorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I'll get that changed.
end do | ||
wwAvg(nVertLevels+1,iCell) = 0.0 | ||
rw_p(nVertLevels+1,iCell) = 0.0 | ||
end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we leave this initialization code in the loop further down (around line 2576).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do as asked, and keep the modifications to a minimum.
Though, it seems like bad form to have an if condition inside a loop that doesn't depend on variables in the loop - especially when the code block immediately above tests the same condition and has similar behavior.
Enables the GPU execution of the atm_advance_acoustic_step_work subroutine by adding OpenACC directives. In order to discount the time spent to transfer data between CPU and GPU within this routine, the new timer 'atm_advance_acoustic_step [ACC_data_xfer]' has been added to the log file. Changes include: - Preparing the routine for porting. Modifying whitespace to make regions clear, changing implicit loop assignments to be explicit, and fusing some loops. - Adding OpenACC parallel and loop directives to the do-loops. - Managing the invariant fields needed for this routine in mpas_atm_dynamics_{init,finalize} so they are available across timesteps. - Managing the other fields needed in the routine with OpenACC directives and using default(present) to ensure data isn't missed. default(present) clauses cause a run-time error if data isn't present.
6e40864
to
c304356
Compare
This PR makes small code modifications and adds OpenACC directives so the
atm_advance_acoustic_step_work
routine can execute on GPU(s).Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer:
atm_advance_acoustic_step [ACC_data_xfer]
.Invariant fields used in this routine are also copied to the device within
mpas_atm_dynamics_init
and are deleted inmpas_atm_dynamics_finalize
.