Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu #6460

grnydawn · 2024-06-06T13:59:50Z

Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability on Perlmutter
to match with Nvidia A100 GPUs

…ter-GPU.

github-actions · 2024-06-06T14:02:16Z

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6460/
on branch `gh-pages` at 2024-06-13 16:18 UTC

ndkeen · 2024-06-06T17:36:33Z

Yes that is clearly wrong. Fro git blame, it looks like @jgfouca made the change -- Jim can you say what your thinking was here? And maybe check to see if other changes were made at same time?

perlmutter-login10% git blame cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  1) string(APPEND CONFIG_A
RGS " --host=cray")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  2) set(USE_CUDA "TRUE")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800  3) string(APPEND CPPDEFS 
" -DGPU -DMPAS_OPENACC")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  4) if (COMP_NAME STREQUAL
 gptl)
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  5)   string(APPEND CPPDEF
S " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_GETTIMEOFDAY")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  6) endif()
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  7) string(APPEND CPPDEFS 
" -DTHRUST_IGNORE_CUB_VERSION_CHECK")
dfe1aded789 cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (James Foucar      2023-10-25 14:15:47 -0600  8) string(APPEND CMAKE_CU
DA_FLAGS " -ccbin CC -O2 -arch sm_80 --use_fast_math")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800  9) string(APPEND CMAKE_EX
E_LINKER_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 10) set(SCC "cc")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 11) set(SCXX "CC")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 12) set(SFC "ftn")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800 13) string(APPEND CMAKE_Fo
rtran_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")

xylar · 2024-06-06T17:57:17Z

@grnydawn, thanks for making this PR!

jgfouca · 2024-06-10T18:05:03Z

@ndkeen , I think it's fine. Git blame is showing the commit I made that changes all the macros to the newer style (using cmake names for things instead of custom CIME names).

xylar · 2024-06-10T18:25:13Z

Not that we need to cast blame but I think this was introduced by @amametjanov in 8eb8af2

ndkeen · 2024-06-10T18:31:28Z

I understand -- but looking at PR that made the change might be good to see other changes made. I'm testing some things now -- why do we have the -acc flag? We don't have that flag for the GNU builds? ie gnugpu

xylar · 2024-06-10T19:22:16Z

This was the PR:
#6103

ndkeen · 2024-06-10T19:24:18Z

I don't think we want to build default with openacc. For example the land has openacc pragmas, but are not often tested. If MPAS needs openacc, maybe there is a way to find a conditional compile.

xylar · 2024-06-10T19:29:29Z

I don't think MPAS-Ocean has support for GPUs other than OpenACC, so it should be compiled with OpenACC on GPU machines.

xylar · 2024-06-10T19:30:42Z

@philipwjones would be the better person to address this, though.

philipwjones · 2024-06-10T19:41:24Z

Yes MPAS-ocean uses OpenACC and only OpenACC for GPU so the acc flag is required. The only thing this PR is doing is upgrading the target hardware flags to take advantage of any hardware specific compiler optimization that might be available.

ndkeen · 2024-06-11T14:53:08Z

But we do not want all fortran files being built with ACC. So I suspect you may need to find a way to only issue the acc flags when building those source files needing it (MPAS?).

For example, this test will fail to build with ACC, but works when I turn it off. SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-gpu_gnugpu
I actually needed to add -noacc for this test which may be what's required for nvidia --- ie it may be on by default.
The reason this test fails to build with ACC is because there are ACC pragmas in land source code that has trouble building.

I think currently, GNU builds are not using ACC. What test are you trying?

xylar · 2024-06-11T15:16:44Z

@ndkeen, that may be an important issue to address. I think we're arguing that it's beyond the scope of this particular PR. In Omega development, we can't currently use nvidiagpu-pm-gpu without this fix but aren't particularly interested in fixing the OpenACC issue, which we did not introduce and do not need to have fixed for our purposes. Perhaps a different issue or PR is needed to cover the issues you're bringing up?

ndkeen · 2024-06-11T16:33:14Z

It might be more fundamental. What test are you trying? It looks like a test that might work with nvidiagpu would not work with gnugpu as the acc flags are not on by default?

I need to add -noacc to the nvidia compiled test noted above to allow it to work (as it needs to ignore the land ACC pragams), so I could make a PR where I instead turn OFF ACC by default. Then you could find a way to turn it back on for the files you need.

xylar · 2024-06-11T16:49:56Z

@ndkeen, I think we could work with that. We're making the change here on the Omega develop branch for now but we can revert that before we merge in whatever changes you need to make to E3SM master. We just wanted to make sure this particular issue got fixed on the E3SM side, rather than getting merged in from Omega.

grnydawn · 2024-06-11T18:37:05Z

@ndkeen , @xylar , Omega does not use the "-acc" flag. I think that the "-acc" flag has not caused any issues with the NVHPC compiler for the Omega build because the NVHPC compiler simply ignores it if the source code does not include any OpenAcc directives.

However, the "-gpu=cc70,cc60" flag in "CMAKE_EXE_LINKER_FLAGS" caused an issue when CMake tried to detect the type of compiler, and a simple test code could not be compiled.

ndkeen · 2024-06-11T18:43:21Z

What I'm saying is that PR to add those lines should not have been done at all -- we don't want ACC on by default. Unless we do. And, on top of that, for nvidia (and maybe cray compiler_, we have to turn off acc explicitly.

What test are you trying?

grnydawn · 2024-06-11T19:10:07Z

@ndkeen, I generally agree that component-specific flags, such as the "-acc" flag, are better limited to the specific component. I also agree with @xylar that this "-acc" issue might need another PR.

If you are asking about MPAS-Ocean testing, I think I am not the right person to answer because I am not familiar with the tests for MPAS-Ocean.

If you are asking about Omega testing, currently, there are several unit tests for each algorithm being developed. During Omega building, CIME configurations from an E3SM case, including compiler and CMake variables, are collected.

philipwjones · 2024-06-11T19:56:51Z

Well...this is a nvidia gpu configuration and OpenACC was the means by which mpas ocean, land and even parts of atmosphere at one point were accessing the gpu. So it made sense to have it on as default and doesn't impact non-acc code. The fact that some acc code isn't working is a bug that needs fixing. A large fraction of mpas code is acc enabled so if you do it conditionally it should be done at component level for mpas ocean and not on a per file basis.

ndkeen · 2024-06-11T20:23:16Z

Currently. GNU built code will not use ACC, but nvidia will (ie gnugpu vs nvidagpu). As that's the compiler defaults. Is that what we want? I would think what we want is for ACC to be disabled unless it is requested. We could do it for all code in similar way as we do OpenMP?

philipwjones · 2024-06-11T21:35:47Z

To the extent the compilers support it (gnu, cray and nvidia all have some support with nvidia the most stable) and to the extent it's been tested and shown to have some benefit (not always clear), I think we should use it. Otherwise why did we put all that work into it? My opinion only...

ndkeen · 2024-06-11T22:26:13Z

We could consider something like this in cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake:

if (COMP_NAME STREQUAL elm)
  string(APPEND CMAKE_Fortran_FLAGS " -noacc")
  string(APPEND CMAKE_C_FLAGS " -noacc")
  string(APPEND CMAKE_CXX_FLAGS " -noacc")
else()
  string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel")
  string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel")
endif()

so that source in LAND will not be built with ACC. It will add those flags to all fortran sources (except in LAND) and will just hopefully have no impact without pragams. Will need to verify no ill effects from this -- but currently I think only use of nvidia compilers is for testing.

Unfortunately, with above change, for the following test, it builds, but hits "Bus error".
SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-gpu_nvidiagpu

But then, should we not also enable ACC for gnu builds?

rljacob · 2024-06-12T18:45:12Z

Ok I caught up with this. The "gpu" in a compiler name means "turn on all the GPU things". So yes add -acc to gnugpu.

We have a "gpuacc" CIME-based test suite which currently runs only on nvidia/Perlmutter. It has 2 tests:
SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu - failing on build
SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu - passing.

The difference between those two is the first one has a full ocean model. So something is probably broken in the MPAS-ocean OpenACC calls.

@grnydawn you have to make sure the test that is currently passing continues to pass with these changes.

rljacob · 2024-06-12T18:54:08Z

@ndkeen the land model failing with -acc has nothing to do with this PR. Open a separate issue for that.

grnydawn · 2024-06-12T19:13:11Z

@rljacob , Omega testing does not have an issue with or without the '-acc' flag, so I think the change might not affect Omega. In MPAS-ocean, I will run the 'gpuacc' test suite once the change is made and see if there are any issues.

xylar · 2024-06-12T19:30:51Z

@grnydawn, my understanding of @rljacob's comment is that you need to run SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu and make sure it still passes. @rljacob, is that right?

ndkeen · 2024-06-12T19:37:22Z

After some discussion, we are mostly in agreement on a "rule" that is roughly: any compiler with gpu in the name (*) should turn on all GPU-related flags -- including ACC. That means that nvidia (which has acc on by default) is already ok, but for other compilers, we may need to make changes to enable ACC for GPU builds (such as GNU and Cray).

To address the issue with LAND sources not all building with ACC, we could try what I suggested above for now (or something better). I'm also chatting with @peterdschwartz who may have ideas.

(*) I just want to note my protest here for this original naming scheme -- the compiler name should be the compiler name. And then use modifiers to indicate what else is happening (such as threading, ACC, etc -- or even that build is meant for GPU or not)

rljacob · 2024-06-12T19:41:07Z

@xylar yes you're right. @grnydawn run that test on Perlmutter on this branch.

grnydawn · 2024-06-12T20:19:04Z

@rljacob , I ran "SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu" test on Perlmutter using this branch 'https://github.com/E3SM-Project/E3SM/tree/ykim/machinefiles/fix-nvidiagpu-pm-gpu', and got the following error. It might not be related to this "-acc" issue and will look at it what caused this error.

create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /pscratch/sd/y/youngsun/e3sm_scratch/pm-gpu/PullRequest6460/tests/SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu.20240612_131302_uwtt6j
RUNNING TESTS:
  SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu
Starting CREATE_NEWCASE for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished CREATE_NEWCASE for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 1.163727 seconds (PASS)
Starting XML for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished XML for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 0.381744 seconds (PASS)
Starting SETUP for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished SETUP for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 1.705043 seconds (FAIL). [COMPLETED 1 of 1]
    Case dir: /pscratch/sd/y/youngsun/e3sm_scratch/pm-gpu/PullRequest6460/tests/SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu.20240612_131302_uwtt6j
    Errors were:
        ERROR: No variable SMP_PRESENT found in case

jgfouca · 2024-06-12T20:41:51Z

Something old got into your code.

ERROR: No variable SMP_PRESENT found in case

SMP_PRESENT was replaced with BUILD_THREADED quite a while ago.

rljacob · 2024-06-12T20:55:27Z

@grnydawn did you update your submodules?

grnydawn · 2024-06-12T22:09:01Z

@rljacob , @jgfouca , That is my mistake. After running git submodule update --init --recursive, the case is built successfully and the test job is in the Perlmutter queue. Thanks!

ndkeen · 2024-06-13T00:37:42Z

Noting that we still have the same build fail as noted here #6470
Which is with SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu

That means the only test I can use to verify changes are OK are with:
SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu
but that test was already passing before making change in this PR.

And with or without this change, can't seem to build the other GPU tests that currently run with gnugpu.

SMS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-gpu_gnugpu
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu (which currently also fails to build for diff issue)

I suspect the reason nobody can give name a relevant test here in this issue, is that you are developing with stand-alone MPAS or OMEGA. So there is not a way to verify that these changes are ok.

ndkeen · 2024-06-13T15:57:21Z

I'm trying the GPU tests we have with:

if (COMP_NAME STREQUAL elm)
  string(APPEND CMAKE_Fortran_FLAGS " -noacc")
  string(APPEND CMAKE_C_FLAGS " -noacc")
  string(APPEND CMAKE_CXX_FLAGS " -noacc")
else()
  string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel")
  string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel")
endif()

Regarding the SCREAM test, it seems to have more general problems with nvidia compiler and hits segfault even on CPU
SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-cpu_nvidia (which is a variety of a test in e3sm_gpucxx).
So we can ignore that failing with nvidiagpu.

With MMF test SMS_D_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-gpu_nvidiagpu, it has build issues with YAKL

"/global/cfs/cdirs/e3sm/ndk/repos/pr/ykim_machinefiles_fix-nvidiagpu-pm-gpu/externals/YAKL/src/YAKL_streams_events.h", line 452: error: identifier "NV_IS_HOST" is undefined
        YAKL_EXECUTE_ON_HOST_ONLY( list = new std::vector<Stream>; )
        ^

I'm thinking it's maybe not worth trying to verify that we can build/run a test that uses LAND (with ACC) right now. And we can just make the change here -- which again means that ACC is only supported with nvidiagpu compiler and only with tests that do not use LAND.

make same change for muller-gpu

ndkeen · 2024-06-13T16:05:13Z

I made some minor edits and committed to branch -- if it looks OK, can merge this

…6460) For pm-gpu: Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability to match with Nvidia A100 GPUs Should only affect ACC

ndkeen · 2024-06-13T17:47:40Z

merged to next

Matches Nvidia GPU compute capability (CC) version to cc80 on Perlmut…

b103205

…ter-GPU.

grnydawn requested a review from ndkeen June 6, 2024 13:59

grnydawn assigned ndkeen Jun 6, 2024

rljacob added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Jun 12, 2024

ndkeen added 2 commits June 13, 2024 09:02

add CMAKE_CUDA_ARCHITECTURES as in gnugpu

5f37229

make same change for muller-gpu

remove comments accidentally committed

578f1bd

ndkeen changed the title ~~Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on Perlmutter~~ Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu Jun 13, 2024

ndkeen added the nvidia compiler nvidia compiler (formerly PGI) label Jun 13, 2024

ndkeen approved these changes Jun 13, 2024

View reviewed changes

ndkeen merged commit 3f489e8 into master Jun 18, 2024
21 checks passed

ndkeen deleted the ykim/machinefiles/fix-nvidiagpu-pm-gpu branch June 18, 2024 18:03

Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu #6460

Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu #6460

Conversation

grnydawn commented Jun 6, 2024

github-actions bot commented Jun 6, 2024 • edited Loading

ndkeen commented Jun 6, 2024

xylar commented Jun 6, 2024

jgfouca commented Jun 10, 2024

xylar commented Jun 10, 2024

ndkeen commented Jun 10, 2024

xylar commented Jun 10, 2024

ndkeen commented Jun 10, 2024

xylar commented Jun 10, 2024 • edited Loading

xylar commented Jun 10, 2024

philipwjones commented Jun 10, 2024

ndkeen commented Jun 11, 2024 • edited Loading

xylar commented Jun 11, 2024

ndkeen commented Jun 11, 2024

xylar commented Jun 11, 2024

grnydawn commented Jun 11, 2024 • edited Loading

ndkeen commented Jun 11, 2024

grnydawn commented Jun 11, 2024

philipwjones commented Jun 11, 2024 • edited Loading

ndkeen commented Jun 11, 2024

philipwjones commented Jun 11, 2024

ndkeen commented Jun 11, 2024 • edited Loading

rljacob commented Jun 12, 2024

rljacob commented Jun 12, 2024

grnydawn commented Jun 12, 2024

xylar commented Jun 12, 2024

ndkeen commented Jun 12, 2024 • edited Loading

rljacob commented Jun 12, 2024

grnydawn commented Jun 12, 2024

jgfouca commented Jun 12, 2024

rljacob commented Jun 12, 2024

grnydawn commented Jun 12, 2024

ndkeen commented Jun 13, 2024 • edited Loading

ndkeen commented Jun 13, 2024

ndkeen commented Jun 13, 2024

ndkeen commented Jun 13, 2024

github-actions bot commented Jun 6, 2024 •

edited

Loading

xylar commented Jun 10, 2024 •

edited

Loading

ndkeen commented Jun 11, 2024 •

edited

Loading

grnydawn commented Jun 11, 2024 •

edited

Loading

philipwjones commented Jun 11, 2024 •

edited

Loading

ndkeen commented Jun 11, 2024 •

edited

Loading

ndkeen commented Jun 12, 2024 •

edited

Loading

ndkeen commented Jun 13, 2024 •

edited

Loading