-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nvidiagpu_pm-gpu.cmake to use Nvidia cc80 compute capability on pm-gpu #6460
Conversation
|
Yes that is clearly wrong. Fro git blame, it looks like @jgfouca made the change -- Jim can you say what your thinking was here? And maybe check to see if other changes were made at same time?
|
@grnydawn, thanks for making this PR! |
@ndkeen , I think it's fine. Git blame is showing the commit I made that changes all the macros to the newer style (using cmake names for things instead of custom CIME names). |
Not that we need to cast blame but I think this was introduced by @amametjanov in 8eb8af2 |
I understand -- but looking at PR that made the change might be good to see other changes made. I'm testing some things now -- why do we have the |
This was the PR: |
I don't think we want to build default with openacc. For example the land has openacc pragmas, but are not often tested. If MPAS needs openacc, maybe there is a way to find a conditional compile. |
I don't think MPAS-Ocean has support for GPUs other than OpenACC, so it should be compiled with OpenACC on GPU machines. |
@philipwjones would be the better person to address this, though. |
Yes MPAS-ocean uses OpenACC and only OpenACC for GPU so the acc flag is required. The only thing this PR is doing is upgrading the target hardware flags to take advantage of any hardware specific compiler optimization that might be available. |
But we do not want all fortran files being built with ACC. So I suspect you may need to find a way to only issue the acc flags when building those source files needing it (MPAS?). For example, this test will fail to build with ACC, but works when I turn it off. I think currently, GNU builds are not using ACC. What test are you trying? |
@ndkeen, that may be an important issue to address. I think we're arguing that it's beyond the scope of this particular PR. In Omega development, we can't currently use nvidiagpu-pm-gpu without this fix but aren't particularly interested in fixing the OpenACC issue, which we did not introduce and do not need to have fixed for our purposes. Perhaps a different issue or PR is needed to cover the issues you're bringing up? |
It might be more fundamental. What test are you trying? It looks like a test that might work with nvidiagpu would not work with gnugpu as the acc flags are not on by default? I need to add |
@ndkeen, I think we could work with that. We're making the change here on the Omega develop branch for now but we can revert that before we merge in whatever changes you need to make to E3SM master. We just wanted to make sure this particular issue got fixed on the E3SM side, rather than getting merged in from Omega. |
@ndkeen , @xylar , Omega does not use the "-acc" flag. I think that the "-acc" flag has not caused any issues with the NVHPC compiler for the Omega build because the NVHPC compiler simply ignores it if the source code does not include any OpenAcc directives. However, the "-gpu=cc70,cc60" flag in "CMAKE_EXE_LINKER_FLAGS" caused an issue when CMake tried to detect the type of compiler, and a simple test code could not be compiled. |
What I'm saying is that PR to add those lines should not have been done at all -- we don't want ACC on by default. Unless we do. And, on top of that, for nvidia (and maybe cray compiler_, we have to turn off acc explicitly. What test are you trying? |
@ndkeen, I generally agree that component-specific flags, such as the "-acc" flag, are better limited to the specific component. I also agree with @xylar that this "-acc" issue might need another PR. If you are asking about MPAS-Ocean testing, I think I am not the right person to answer because I am not familiar with the tests for MPAS-Ocean. If you are asking about Omega testing, currently, there are several unit tests for each algorithm being developed. During Omega building, CIME configurations from an E3SM case, including compiler and CMake variables, are collected. |
Well...this is a nvidia gpu configuration and OpenACC was the means by which mpas ocean, land and even parts of atmosphere at one point were accessing the gpu. So it made sense to have it on as default and doesn't impact non-acc code. The fact that some acc code isn't working is a bug that needs fixing. A large fraction of mpas code is acc enabled so if you do it conditionally it should be done at component level for mpas ocean and not on a per file basis. |
Currently. GNU built code will not use ACC, but nvidia will (ie |
To the extent the compilers support it (gnu, cray and nvidia all have some support with nvidia the most stable) and to the extent it's been tested and shown to have some benefit (not always clear), I think we should use it. Otherwise why did we put all that work into it? My opinion only... |
We could consider something like this in
so that source in LAND will not be built with ACC. It will add those flags to all fortran sources (except in LAND) and will just hopefully have no impact without pragams. Will need to verify no ill effects from this -- but currently I think only use of nvidia compilers is for testing. Unfortunately, with above change, for the following test, it builds, but hits "Bus error". But then, should we not also enable ACC for gnu builds? |
Ok I caught up with this. The "gpu" in a compiler name means "turn on all the GPU things". So yes add -acc to gnugpu. We have a "gpuacc" CIME-based test suite which currently runs only on nvidia/Perlmutter. It has 2 tests: The difference between those two is the first one has a full ocean model. So something is probably broken in the MPAS-ocean OpenACC calls. @grnydawn you have to make sure the test that is currently passing continues to pass with these changes. |
@ndkeen the land model failing with -acc has nothing to do with this PR. Open a separate issue for that. |
@rljacob , Omega testing does not have an issue with or without the '-acc' flag, so I think the change might not affect Omega. In MPAS-ocean, I will run the 'gpuacc' test suite once the change is made and see if there are any issues. |
After some discussion, we are mostly in agreement on a "rule" that is roughly: any compiler with To address the issue with LAND sources not all building with ACC, we could try what I suggested above for now (or something better). I'm also chatting with @peterdschwartz who may have ideas. (*) I just want to note my protest here for this original naming scheme -- the compiler name should be the compiler name. And then use modifiers to indicate what else is happening (such as threading, ACC, etc -- or even that build is meant for GPU or not) |
@rljacob , I ran "SMS_Ld1.T62_oEC60to30v3.DTESTM.pm-gpu_nvidiagpu" test on Perlmutter using this branch 'https://github.com/E3SM-Project/E3SM/tree/ykim/machinefiles/fix-nvidiagpu-pm-gpu', and got the following error. It might not be related to this "-acc" issue and will look at it what caused this error.
|
Something old got into your code.
|
@grnydawn did you update your submodules? |
Noting that we still have the same build fail as noted here #6470 That means the only test I can use to verify changes are OK are with: And with or without this change, can't seem to build the other GPU tests that currently run with
I suspect the reason nobody can give name a relevant test here in this issue, is that you are developing with stand-alone MPAS or OMEGA. So there is not a way to verify that these changes are ok. |
I'm trying the GPU tests we have with:
Regarding the SCREAM test, it seems to have more general problems with nvidia compiler and hits segfault even on CPU With MMF test
I'm thinking it's maybe not worth trying to verify that we can build/run a test that uses LAND (with ACC) right now. And we can just make the change here -- which again means that ACC is only supported with |
make same change for muller-gpu
I made some minor edits and committed to branch -- if it looks OK, can merge this |
…6460) For pm-gpu: Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability to match with Nvidia A100 GPUs Should only affect ACC
merged to next |
Update CMAKE_EXE_LINKER_FLAGS Cmake variable to use Nvidia cc80 compute capability on Perlmutter
to match with Nvidia A100 GPUs