-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up kokkos/tril situation #6473
Conversation
If Trilinos is being used, we should use the kokkos that comes with it. This removes the need to explicitly set Kokkos_ROOT in config_machines.xml to the Trilinos one.
|
set(ENV{Kokkos_ROOT} ${INSTALL_SHAREDPATH}) | ||
# We should use the kokkos from Trilinos if we are using Trilinos | ||
if (USE_TRILINOS) | ||
set (ENV{Kokkos_ROOT} ${Trilinos_ROOT}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential issue with this is that when we switch to the preinstalled kokkos, we'll need a new path. But maybe I can change this then by creating something like Trilinos_Kokkos_ROOT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at first Kokkos_ROOT is only used to decide whether CIME needs to build kokkos during sharedlib phase. Then, in the CMake logic of the modelbuild phase, we could look for Kokkos only after handling Albany/Trilinos (if needed), so that we may skip find_package(Kokkos)
altogether. Something like
if (USE_ALBANY)
find_package(Albany REQUIRED) # this also finds trilinos and kokkos, as they are deps
endif()
if (USE_TRILINOS)
find_package(Trilinos REQUIRED) # this also finds kokkos, as it is a dep
endif()
if (USE_KOKKOS)
find_package(Kokkos REQUIRED)
endif()
In this version, if USE_ALBANY=TRUE, the calls to find_pacakge
for trilinos and kokkos return immediately, since the package was alrady found.
Right now, instead, we first look for kokkos, then for trilinos, and finally for albany. We should do it the other way around, since cmake pulls in deps anyways... If we do that, then Kokkos_ROOT becomes unused, since Kokkos is already found when find_package(Kokkos REQUIRED)
is parsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartgol , those latter two blocks should be elseif
s , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter. The calls to find_package return immediately if the pkg was already found anyways...
What's the long term solution when we'll have different components using Kokkos? |
Yes what happens when EAMxx and MALI are both running? |
@rljacob we first need to answer another question: should e3sm force all components to use the same backend in kokkos? Or should they be free to use different ones? I think the latter is the right choice. Though we can (and should) force all components to use the same installation of kokkos. That said, we could potentially install a prebuilt kokkos with all "reasonable" backends enabled (that is, Serial/OpenMP on CPU machines, Serial/OpenMP/Cuda on pm-gpu, Serial/OpenMP/HIP on Frontier,...). Then, each component is free to use whatever backend they need from kokkos. If MALI was installed with both Serial/OpenMP enabled, then eamxx will work just fine with compile_threaded on or off, regardless of whether MALI uses OpenMP or Serial. |
I agree although we will all have to be really careful to explicitly use the device we want to use. The default device will change depending on Kokkos cmake settings. |
Yes, using the default device should not be allowed. I don't know how to tell different components to use different devices (like "lnd use OpenMP, atm use Cuda, ..."), or how to force components to use Serial if compile_threaded=false. I don't even know exactly what we want to achieve. I suppose that if component A can run on GPU, we will never ask it to run on CPU on a gpu machine, right? The only catch is then how to prevent use of OpenMP if compile_threaded=False. I think we could have E3SM define the CMake var E3SM_KOKKOS_DEVICE, which components can use to pick a pre-defined device. E.g., e3sm can set E3SM_KOKKOS_DEVICE=Serial if compile_threaded=False (on a CPU machine only, since on GPU we prob want to use the GPU device). If, otoh, we may want component A to run on CPU even if we are on a GPU machine (e.g., component B can use GPU better), then we need new XML logic to specify a device for each component. Something like ATM_KOKKOS_DEVICE, LND_KOKKOS_DEVICE, ..., all of which can be initialized to whatever is the default device on that machine... But this would conflict a bit with compile_threaded, there would be some redundant (and possibly conflicting) information. |
If we can preinstall kokkos with Serial/OpenMP and have compile_threaded use the appropriate device for both eamxx/MALI, that seems to be the solution. I'm not sure if that's possible with preinstalled trilinos/albany. At worst, we'd have to preinstall two different versions of each. Alternatively, we could force trilinos/albany to use a specific backend regardless of e3sm parameters. It probably does make sense to have all backends though to be able to control which component runs on CPU/GPU. |
When I try to build on pm-cpu with
If I replace all of those get_string with get_si_string, it gets further
but still fails:
|
actually, I'm not sure if two different versions would work either if we use kokkos default device and we don't have a "TRILINOS_KOKKOS_DEVICE" or "ALBANY_KOKKOS_DEVICE". |
I don't think EAMxx and MALI will appear in the same compset until low-res EAMxx is working so an intermediate solution is fine for now. |
@jewatkins yeah, if you want MALI to honor compile_threaded, then we need multiple MALI installations. I don't see a way out of that. That said, if compile_threaded=False, we could have E3SM set OMP_NUM_THREADS=1, so that even if MALI used OpenMP all the time, it would still result in a single-thread run (Serial is not the same as OpenMP with 1 thread in terms of perf, but it should be bfb the same). |
Not sure if this is what you meant, but with E3SM, builds without openmp are not BFB with builds using openmp and only 1 thread. Only threaded builds are BFB with each other. |
O_O Isn't this a bit concerning? |
Yes, it's always bugged me, but there were some strange openmp decisions made long ago (before I joined). Was kinda hoping we would be rid of them once moved to C++/kokkos. |
Well, in all fairness, it is possible that reduction kernels are handled slightly differently in Serial vs OpenMP. Though that's a remote possibility.. |
Currently BG cases, which have both EAM and MALI active, are broken. So it's not just eamxx? Or was it the removal of Kokkos_ROOT machine settings in PR #6489? |
@jgfouca PR #6514 got merged, which had some overlap with this one. I am not sure if we still need this PR. I think 6514 handled kokkos differently, but what is done there is I think covering what this PR was doing, so maybe this is no longer needed. But I'll let you take a look and decide. In any case, there are some conflicts now that 6514 went in. |
I don't think we still need it. |
If Trilinos is being used, we should use the kokkos that comes with it. This removes the need to explicitly set Kokkos_ROOT in config_machines.xml to the Trilinos one.
[BFB]