Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu #6460

Merged
merged 3 commits into from
Jun 18, 2024

Conversation

grnydawn
Copy link
Contributor

@grnydawn grnydawn commented Jun 6, 2024

Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability on Perlmutter
to match with Nvidia A100 GPUs

@grnydawn grnydawn requested a review from ndkeen June 6, 2024 13:59
Copy link

github-actions bot commented Jun 6, 2024

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6460/
on branch gh-pages at 2024-06-13 16:18 UTC

@ndkeen
Copy link
Contributor

ndkeen commented Jun 6, 2024

Yes that is clearly wrong. Fro git blame, it looks like @jgfouca made the change -- Jim can you say what your thinking was here? And maybe check to see if other changes were made at same time?

perlmutter-login10% git blame cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  1) string(APPEND CONFIG_A
RGS " --host=cray")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  2) set(USE_CUDA "TRUE")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800  3) string(APPEND CPPDEFS 
" -DGPU -DMPAS_OPENACC")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  4) if (COMP_NAME STREQUAL
 gptl)
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  5)   string(APPEND CPPDEF
S " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_GETTIMEOFDAY")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  6) endif()
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800  7) string(APPEND CPPDEFS 
" -DTHRUST_IGNORE_CUB_VERSION_CHECK")
dfe1aded789 cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (James Foucar      2023-10-25 14:15:47 -0600  8) string(APPEND CMAKE_CU
DA_FLAGS " -ccbin CC -O2 -arch sm_80 --use_fast_math")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800  9) string(APPEND CMAKE_EX
E_LINKER_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 10) set(SCC "cc")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 11) set(SCXX "CC")
e63dfd2b172 cime_config/machines/cmake_macros/nvidiagpu_perlmutter.cmake (noel              2021-11-15 12:24:54 -0800 12) set(SFC "ftn")
8eb8af2c4aa cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake     (Azamat Mametjanov 2023-12-08 15:53:37 -0800 13) string(APPEND CMAKE_Fo
rtran_FLAGS " -acc -gpu=cc70,cc60 -Minfo=accel")

@xylar
Copy link
Contributor

xylar commented Jun 6, 2024

@grnydawn, thanks for making this PR!

@jgfouca
Copy link
Member

jgfouca commented Jun 10, 2024

@ndkeen , I think it's fine. Git blame is showing the commit I made that changes all the macros to the newer style (using cmake names for things instead of custom CIME names).

@xylar
Copy link
Contributor

xylar commented Jun 10, 2024

Not that we need to cast blame but I think this was introduced by @amametjanov in 8eb8af2

@ndkeen
Copy link
Contributor

ndkeen commented Jun 10, 2024

I understand -- but looking at PR that made the change might be good to see other changes made. I'm testing some things now -- why do we have the -acc flag? We don't have that flag for the GNU builds? ie gnugpu

@xylar
Copy link
Contributor

xylar commented Jun 10, 2024

This was the PR:
#6103

@ndkeen
Copy link
Contributor

ndkeen commented Jun 10, 2024

I don't think we want to build default with openacc. For example the land has openacc pragmas, but are not often tested. If MPAS needs openacc, maybe there is a way to find a conditional compile.

@xylar
Copy link
Contributor

xylar commented Jun 10, 2024

I don't think MPAS-Ocean has support for GPUs other than OpenACC, so it should be compiled with OpenACC on GPU machines.

@xylar
Copy link
Contributor

xylar commented Jun 10, 2024

@philipwjones would be the better person to address this, though.

@philipwjones
Copy link
Contributor

Yes MPAS-ocean uses OpenACC and only OpenACC for GPU so the acc flag is required. The only thing this PR is doing is upgrading the target hardware flags to take advantage of any hardware specific compiler optimization that might be available.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 11, 2024

But we do not want all fortran files being built with ACC. So I suspect you may need to find a way to only issue the acc flags when building those source files needing it (MPAS?).

For example, this test will fail to build with ACC, but works when I turn it off. SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-gpu_gnugpu
I actually needed to add -noacc for this test which may be what's required for nvidia --- ie it may be on by default.
The reason this test fails to build with ACC is because there are ACC pragmas in land source code that has trouble building.

I think currently, GNU builds are not using ACC. What test are you trying?

@xylar
Copy link
Contributor

xylar commented Jun 11, 2024

@ndkeen, that may be an important issue to address. I think we're arguing that it's beyond the scope of this particular PR. In Omega development, we can't currently use nvidiagpu-pm-gpu without this fix but aren't particularly interested in fixing the OpenACC issue, which we did not introduce and do not need to have fixed for our purposes. Perhaps a different issue or PR is needed to cover the issues you're bringing up?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 11, 2024

It might be more fundamental. What test are you trying? It looks like a test that might work with nvidiagpu would not work with gnugpu as the acc flags are not on by default?

I need to add -noacc to the nvidia compiled test noted above to allow it to work (as it needs to ignore the land ACC pragams), so I could make a PR where I instead turn OFF ACC by default. Then you could find a way to turn it back on for the files you need.

@xylar
Copy link
Contributor

xylar commented Jun 11, 2024

@ndkeen, I think we could work with that. We're making the change here on the Omega develop branch for now but we can revert that before we merge in whatever changes you need to make to E3SM master. We just wanted to make sure this particular issue got fixed on the E3SM side, rather than getting merged in from Omega.

@grnydawn
Copy link
Contributor Author

grnydawn commented Jun 11, 2024

@ndkeen , @xylar , Omega does not use the "-acc" flag. I think that the "-acc" flag has not caused any issues with the NVHPC compiler for the Omega build because the NVHPC compiler simply ignores it if the source code does not include any OpenAcc directives.

However, the "-gpu=cc70,cc60" flag in "CMAKE_EXE_LINKER_FLAGS" caused an issue when CMake tried to detect the type of compiler, and a simple test code could not be compiled.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 11, 2024

What I'm saying is that PR to add those lines should not have been done at all -- we don't want ACC on by default. Unless we do. And, on top of that, for nvidia (and maybe cray compiler_, we have to turn off acc explicitly.

What test are you trying?

@grnydawn
Copy link
Contributor Author

@ndkeen, I generally agree that component-specific flags, such as the "-acc" flag, are better limited to the specific component. I also agree with @xylar that this "-acc" issue might need another PR.

If you are asking about MPAS-Ocean testing, I think I am not the right person to answer because I am not familiar with the tests for MPAS-Ocean.

If you are asking about Omega testing, currently, there are several unit tests for each algorithm being developed. During Omega building, CIME configurations from an E3SM case, including compiler and CMake variables, are collected.

@philipwjones
Copy link
Contributor

philipwjones commented Jun 11, 2024

Well...this is a nvidia gpu configuration and OpenACC was the means by which mpas ocean, land and even parts of atmosphere at one point were accessing the gpu. So it made sense to have it on as default and doesn't impact non-acc code. The fact that some acc code isn't working is a bug that needs fixing. A large fraction of mpas code is acc enabled so if you do it conditionally it should be done at component level for mpas ocean and not on a per file basis.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 11, 2024

Currently. GNU built code will not use ACC, but nvidia will (ie gnugpu vs nvidagpu). As that's the compiler defaults. Is that what we want? I would think what we want is for ACC to be disabled unless it is requested. We could do it for all code in similar way as we do OpenMP?

@philipwjones
Copy link
Contributor

To the extent the compilers support it (gnu, cray and nvidia all have some support with nvidia the most stable) and to the extent it's been tested and shown to have some benefit (not always clear), I think we should use it. Otherwise why did we put all that work into it? My opinion only...

@ndkeen
Copy link
Contributor

ndkeen commented Jun 11, 2024

We could consider something like this in cime_config/machines/cmake_macros/nvidiagpu_pm-gpu.cmake:

if (COMP_NAME STREQUAL elm)
  string(APPEND CMAKE_Fortran_FLAGS " -noacc")
  string(APPEND CMAKE_C_FLAGS " -noacc")
  string(APPEND CMAKE_CXX_FLAGS " -noacc")
else()
  string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel")
  string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel")
endif()

so that source in LAND will not be built with ACC. It will add those flags to all fortran sources (except in LAND) and will just hopefully have no impact without pragams. Will need to verify no ill effects from this -- but currently I think only use of nvidia compilers is for testing.

Unfortunately, with above change, for the following test, it builds, but hits "Bus error".
SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-gpu_nvidiagpu

But then, should we not also enable ACC for gnu builds?

@rljacob
Copy link
Member

rljacob commented Jun 12, 2024

Ok I caught up with this. The "gpu" in a compiler name means "turn on all the GPU things". So yes add -acc to gnugpu.

We have a "gpuacc" CIME-based test suite which currently runs only on nvidia/Perlmutter. It has 2 tests:
SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu - failing on build
SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu - passing.

The difference between those two is the first one has a full ocean model. So something is probably broken in the MPAS-ocean OpenACC calls.

@grnydawn you have to make sure the test that is currently passing continues to pass with these changes.

@rljacob
Copy link
Member

rljacob commented Jun 12, 2024

@ndkeen the land model failing with -acc has nothing to do with this PR. Open a separate issue for that.

@grnydawn
Copy link
Contributor Author

@rljacob , Omega testing does not have an issue with or without the '-acc' flag, so I think the change might not affect Omega. In MPAS-ocean, I will run the 'gpuacc' test suite once the change is made and see if there are any issues.

@xylar
Copy link
Contributor

xylar commented Jun 12, 2024

@grnydawn, my understanding of @rljacob's comment is that you need to run SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu and make sure it still passes. @rljacob, is that right?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 12, 2024

After some discussion, we are mostly in agreement on a "rule" that is roughly: any compiler with gpu in the name (*) should turn on all GPU-related flags -- including ACC. That means that nvidia (which has acc on by default) is already ok, but for other compilers, we may need to make changes to enable ACC for GPU builds (such as GNU and Cray).

To address the issue with LAND sources not all building with ACC, we could try what I suggested above for now (or something better). I'm also chatting with @peterdschwartz who may have ideas.

(*) I just want to note my protest here for this original naming scheme -- the compiler name should be the compiler name. And then use modifiers to indicate what else is happening (such as threading, ACC, etc -- or even that build is meant for GPU or not)

@rljacob
Copy link
Member

rljacob commented Jun 12, 2024

@xylar yes you're right. @grnydawn run that test on Perlmutter on this branch.

@rljacob rljacob added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Jun 12, 2024
@grnydawn
Copy link
Contributor Author

@rljacob , I ran "SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu" test on Perlmutter using this branch 'https://github.com/E3SM-Project/E3SM/tree/ykim/machinefiles/fix-nvidiagpu-pm-gpu', and got the following error. It might not be related to this "-acc" issue and will look at it what caused this error.

create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /pscratch/sd/y/youngsun/e3sm_scratch/pm-gpu/PullRequest6460/tests/SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu.20240612_131302_uwtt6j
RUNNING TESTS:
  SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu
Starting CREATE_NEWCASE for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished CREATE_NEWCASE for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 1.163727 seconds (PASS)
Starting XML for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished XML for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 0.381744 seconds (PASS)
Starting SETUP for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu with 1 procs
Finished SETUP for test SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu in 1.705043 seconds (FAIL). [COMPLETED 1 of 1]
    Case dir: /pscratch/sd/y/youngsun/e3sm_scratch/pm-gpu/PullRequest6460/tests/SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu.20240612_131302_uwtt6j
    Errors were:
        ERROR: No variable SMP_PRESENT found in case

@jgfouca
Copy link
Member

jgfouca commented Jun 12, 2024

Something old got into your code.

ERROR: No variable SMP_PRESENT found in case

SMP_PRESENT was replaced with BUILD_THREADED quite a while ago.

@rljacob
Copy link
Member

rljacob commented Jun 12, 2024

@grnydawn did you update your submodules?

@grnydawn
Copy link
Contributor Author

@rljacob , @jgfouca , That is my mistake. After running git submodule update --init --recursive, the case is built successfully and the test job is in the Perlmutter queue. Thanks!

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

Noting that we still have the same build fail as noted here #6470
Which is with SMS_Ld1.T62_oEC60to30v3.CMPASO-NYF.pm-gpu_nvidiagpu

That means the only test I can use to verify changes are OK are with:
SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu
but that test was already passing before making change in this PR.

And with or without this change, can't seem to build the other GPU tests that currently run with gnugpu.

SMS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-gpu_gnugpu
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu (which currently also fails to build for diff issue)

I suspect the reason nobody can give name a relevant test here in this issue, is that you are developing with stand-alone MPAS or OMEGA. So there is not a way to verify that these changes are ok.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

I'm trying the GPU tests we have with:

if (COMP_NAME STREQUAL elm)
  string(APPEND CMAKE_Fortran_FLAGS " -noacc")
  string(APPEND CMAKE_C_FLAGS " -noacc")
  string(APPEND CMAKE_CXX_FLAGS " -noacc")
else()
  string(APPEND CMAKE_Fortran_FLAGS " -acc -gpu=cc80 -Minfo=accel")
  string(APPEND CMAKE_EXE_LINKER_FLAGS " -acc -gpu=cc80 -Minfo=accel")
endif()

Regarding the SCREAM test, it seems to have more general problems with nvidia compiler and hits segfault even on CPU
SMS_D.ne4pg2_ne4pg2.F2010-SCREAM-LR.pm-cpu_nvidia (which is a variety of a test in e3sm_gpucxx).
So we can ignore that failing with nvidiagpu.

With MMF test SMS_D_Ln9.ne4pg2_ne4pg2.F2010-MMF1.pm-gpu_nvidiagpu, it has build issues with YAKL

"/global/cfs/cdirs/e3sm/ndk/repos/pr/ykim_machinefiles_fix-nvidiagpu-pm-gpu/externals/YAKL/src/YAKL_streams_events.h", line 452: error: identifier "NV_IS_HOST" is undefined
        YAKL_EXECUTE_ON_HOST_ONLY( list = new std::vector<Stream>; )
        ^

I'm thinking it's maybe not worth trying to verify that we can build/run a test that uses LAND (with ACC) right now. And we can just make the change here -- which again means that ACC is only supported with nvidiagpu compiler and only with tests that do not use LAND.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

I made some minor edits and committed to branch -- if it looks OK, can merge this

@ndkeen ndkeen changed the title Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on Perlmutter Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu Jun 13, 2024
@ndkeen ndkeen added the nvidia compiler nvidia compiler (formerly PGI) label Jun 13, 2024
ndkeen added a commit that referenced this pull request Jun 13, 2024
…6460)

For pm-gpu: Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability
to match with Nvidia A100 GPUs
Should only affect ACC
@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

merged to next

@ndkeen ndkeen merged commit 3f489e8 into master Jun 18, 2024
21 checks passed
@ndkeen ndkeen deleted the ykim/machinefiles/fix-nvidiagpu-pm-gpu branch June 18, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files nvidia compiler nvidia compiler (formerly PGI) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants