Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing layerthicknessedge decl with pgigpu #4865

Closed
amametjanov opened this issue Mar 31, 2022 · 8 comments · Fixed by #4866
Closed

Missing layerthicknessedge decl with pgigpu #4865

amametjanov opened this issue Mar 31, 2022 · 8 comments · Fixed by #4866

Comments

@amametjanov
Copy link
Member

With the latest next branch, nvfortran is raising a build error on Summit and Ascent for

./cime/scripts/create_test SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF --compiler pgigpu

which appears to be from conversion of layerThicknessEdge to layerThicknessFlux in #4832.

NVFORTRAN-S-0038-Symbol, layerthicknessedge, has not been explicitly declared (
/gpfs/alpine/cli115/proj-shared/azamat/e3sm_scratch/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.summit_pgigpu.20220331-hadv-openacc/bld/cmake-bld/core_ocean/shared/mpas_ocn_thick_hadv.f90: 140)

140       !$acc parallel loop &
141       !$acc     present(tend, normalVelocity, dvEdge, layerThicknessEdge, edgeSignOnCell, &
142       !$acc     minLevelEdgeBot, maxLevelEdgeBot) &
143       !$acc     private(i, k, invAreaCell, flux)
144       do iCell = 1, nCellsOwned
145         invAreaCell = 1.0_RKIND / areaCell(iCell)
146         do i = 1, nEdgesOnCell(iCell)
147           iEdge = edgesOnCell(i, iCell)
148           do k = minLevelEdgeBot(iEdge), maxLevelEdgeTop(iEdge)
149             flux = normalVelocity(k, iEdge) * dvEdge(iEdge) * layerThickEdgeFlux(k, iEdge)
150             tend(k, iCell) = tend(k, iCell) + edgeSignOnCell(i, iCell) * flux * invAreaCell
151           end do
152         end do
153       end do

FYI ^@brian-oneill

@philipwjones
Copy link
Contributor

Thanks @amametjanov - I stumbled on this related to another PR review today too. Looks like the change wasn't propagated into all the acc directives. @jonbob I can put together a quick fix for this today.

@philipwjones
Copy link
Contributor

Well...thought it was just a few missing transfer directives, but something else is missing so not such a quick fix. Digging a little more.

@philipwjones
Copy link
Contributor

philipwjones commented Apr 1, 2022

When the new layerThickEdgeFlux quantity is calculated, it was being set to the Mean variable with array syntax and no OpenACC so when OpenACC enabled, the Mean variables is computed on the device and the Flux variable is set equal to the Mean value on the host that hasn't been updated from the device. PR to fix coming...

@amametjanov
Copy link
Member Author

amametjanov commented Apr 2, 2022

Thanks! The PR fixed the build error.
I also needed to add the 2 mods below to get past run-time errors for T62_oEC60to30v3 + CMPASO-NYF in E3SM (landIcePressureOn == .false.).

1: 2: FATAL ERROR: data in PRESENT clause was not found on device 1: name=landicepressure host:0x6bdae820
1: 2:  file:bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics.f90 ocn_diagnostic_solve_surface_pressure line:1507

1507       !$acc parallel loop present(surfacePressure, atmosphericPressure, &
1508       !$acc                       seaIcePressure, frazilSurfacePressure, &
1509       !$acc                       landIcePressure)
1510       do iCell = 1, nCells
1511          ! Pressure for generalized coordinates.
1512          ! Pressure at top surface may be due to atmospheric pressure, frazil
1513          ! ice, sea ice or the weight of an ice shelf
1514          surfacePressure(iCell) = atmosphericPressure(iCell) + seaIcePressure(iCell)
1515          if ( associated(frazilSurfacePressure) ) &
1516             surfacePressure(iCell) = surfacePressure(iCell) + frazilSurfacePressure(iCell)
1517          if ( associated(landIcePressure) ) &
1518             surfacePressure(iCell) = surfacePressure(iCell) + landIcePressure(iCell)
1519       end do

Summit's nvhpc/21.3 is eagerly checking and throwing errors for vars in present clauses that are not used in this configuration. Can you add the 2 mods to PR #4866 ? This and the latest Scorpio update lets the case run to completion.

diff --git a/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F b/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
index 1ea0ddee56..2de60dcb4b 100644
--- a/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
+++ b/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
@@ -1729,8 +1729,7 @@ contains

 #ifdef MPAS_OPENACC
       !$acc parallel loop present(surfacePressure, atmosphericPressure, &
-      !$acc                       seaIcePressure, frazilSurfacePressure, &
-      !$acc                       landIcePressure)
+      !$acc                       seaIcePressure, frazilSurfacePressure)
 #else 
       !$omp parallel
       !$omp do schedule(runtime)
@@ -2040,7 +2039,7 @@ contains

       if (landIcePressureOn) then
 #ifdef MPAS_OPENACC
-         !$acc parallel loop present(pressureAdjustedSSH, landIceDraft)
+         !$acc parallel loop present(pressureAdjustedSSH)
 #else
          !$omp parallel
          !$omp do schedule(runtime)

@philipwjones
Copy link
Contributor

Thanks @amametjanov , pushing a change now...

@philipwjones
Copy link
Contributor

@amametjanov - the iceDraft case should be correct, so I'm guessing nvhpc might be having trouble because it's an optional argument? I'll go ahead and remove from the directive for now. I have a better fix for both, but it involves a more extensive change that will need its own PR.

@amametjanov
Copy link
Member Author

With the latest branch update, I still had to remove landIcePressure from the present clause to avoid that error and to get the case to run to completion:

1: 12: FATAL ERROR: data in PRESENT clause was not found on device 1: name=landicepressure host:0x494a43e0
1: 12:  file:
/gpfs/alpine/cli115/proj-shared/azamat/e3sm_scratch/pgigpu/20220331/SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.summit_pgigpu.0402/bld/cmake-bld/core_ocean/shared/mpas_ocn_diagnostics.f90 
ocn_diagnostic_solve_surface_pressure line:1532

diff --git a/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F b/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
index 8b5d1efc69..f8b8c3bc89 100644
--- a/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
+++ b/components/mpas-ocean/src/shared/mpas_ocn_diagnostics.F
@@ -1769,7 +1769,7 @@ contains
       if (landIcePressureOn) then
 #ifdef MPAS_OPENACC
          !$acc parallel loop &
-         !$acc    present(surfacePressure, landIcePressure)
+         !$acc    present(surfacePressure)
 #else
          !$omp do schedule(runtime)
 #endif

$ git checkout next
$ git merge --no-ff pjones/philipwjones/mpas-ocean/acc-thickfix
$ ./cime/scripts/create_test SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF SMS_Ld1_D.T62_oEC60to30v3.CMPASO-NYF --compiler pgigpu --test-id 0402

@philipwjones
Copy link
Contributor

Think I know what the issue and will work on a more permanent fix soon, but for now went ahead and removed from the clause and pushed the change.

jonbob added a commit that referenced this issue Apr 12, 2022
Fixes OpenACC bugs introduced by recent layerThickEdge changes

During a recent addition of layerThickEdgeFlux/Mean variables, some GPU
errors were introduced:
* layerThickEdgeXXX vars were missing from some OpenACC update
  directives
* the use of array syntax when layerThickEdgeFlux set to
  layerThickEdgeMean was causing the operation to happen on
  host (with bad data) rather than device when OpenACC turned on
* replaced complicated string conditionals with case construct to
  improve performance and clarity
* the conditionals above were moved to diagnostics init
* cleaned up the thickness edge routine a little
* fixed related OpenACC present data issues in surface pressure
  calculation by moving option conditionals outside loop and splitting
  loop accordingly
Tested on Summit w/ PGI in QU240 and bfb when OpenACC turned on

Fixes #4865
[BFB]
@jonbob jonbob closed this as completed in 77c2ded Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants