Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E3SM CMake Build System V2! #5943

Open
37 of 49 tasks
jgfouca opened this issue Sep 21, 2023 · 19 comments
Open
37 of 49 tasks

E3SM CMake Build System V2! #5943

jgfouca opened this issue Sep 21, 2023 · 19 comments

Comments

@jgfouca
Copy link
Member

jgfouca commented Sep 21, 2023

Background

I've been meaning to do this for a long time and a few tickets have motivated me to finally launch this effort:

E3SM-Project/scorpio#543
#5786
ESMCI/cime#4308
ESMCI/cime#3050

Our current build system has a lot of technical debt due to the way it evolved from perl+Makefiles to python+CMake. During the transition from Makefile to CMake, I think it made sense to keep as much backwards compatibility as possible in order to minimize disruption. Now that the CMake system has been in place for years, I think we can and should start thinking of refactoring the build system to be more like an industry-standard CMake system.

Dependency Management

I think the worst part of the current system is how we handle our external dependencies. These dependencies fall into two categories, the "TPLs" (things we expect to be installed on the machine and most likely in a loaded env module), and the "sharedlibs" (things we build during the SHAREDLIB phase, these are TPLs that we build ourselves). Here's a list of changes I think we should implement regarding dependency management:

  1. All dependencies should be located via find_package, using $PACKAGE_ROOT to find it, and we should use target_link_libraries to express the dependency.
  2. The above may require the creation of $package-config.cmake files. We should strive to make these as CMake-3 style as possible, target-oriented. If the package is very far from being a modern cmake package, we can write our own Find$package.cmake module.
  3. We should express the "true" dependencies instead of just packing them all onto the e3sm.exe target. For example, if eam uses things from gptl, then we will want target_link_libraries(eam ${GPTL_LIBRARIES})
  4. We should make it easy for users to change TPL dependency locations. I think the easiest way is to have config_machines.xml set the $PACKAGE_ROOT env var IF it's not already set. Something like this:
    <env name="NETCDF_ROOT">$SHELL{RESULT=${NETCDF_ROOT:-$DEFAULT_VALUE} echo $RESULT}}</env>
    Where DEFAULT_VALUE is either hardcoded or references some env var from a module. If the module file itself sets $PACKAGE_ROOT then this may break this approach. I don't know if any of them actually do that.
  5. There should be no need to set up TPL stuff in the cmake macros as long as all the $PACKAGE_ROOTs are set correctly.
  6. For sharedlibs, we should also allow the all of the above. If the build system sees that $SHAREDLIB_ROOT already exists and is populated, the build system should not try to rebuild it.
  7. For sharedlibs, we currently build all of them for every case. We should revisit this and think if this is really necessary.
  8. Currently, a lot of the dependency stuff is handled in common_setup.cmake, which means it's run for every component. This is means a lot of the processing is repeated many times. We should factor this out and put it somewhere else so it runs only once.
  9. If done correctly, the use of SLIBS should almost entirely go away.

Other considerations

  • Our system should always prefer to use Cmake-3, target-oriented, style
  • Purge stale/unused machines/compilers
  • The cmake macros should use cmake names for things. For example, CMAKE_C_FLAGS instead of CFLAGS. This is tricky because some of the non-cmake sharedlibs expect the old Makefile-style names.
  • Remove the OS concept from the cmake macro system. @sarats has already done most of the work for this
  • When creating a case, CIME CCS should only move over the necessary CMake macros instead of all of them
  • The legacy Makefile stuff is most obvious in common_setup.cmake and, to a lesser extent, build_model.cmake. This legacy stuff should be removed or refactored.
  • Eamxx/SCREAM, when built as part of a full case (not standalone), needs to respect the flags set up by CIME.
  • All significant changes to the use of our build system needs to be documented here: https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/1011942171/E3SM+Build+System+and+CMake
  • The cache vars that our cmake expects to receive from CIME/build.py should be explicitly listed, probably in a cache.cmake file that the main CMakeLists.txt file can include. This will allow us be explicit about the type and add documentation.
  • Missing macro warning should only be printed once.
  • It seems test cases no longer support cmake_macro case customizations due to the test infrastructure always doing case resets.
  • Need to think of ways to reduce time building shared libs. What we are doing now is very wasteful
  • The Depends file system is still an odd and non standard way of customizing builds
  • Homme tests should use bld/cmake-bld instead of just bld.
  • When finding packages, we should probably set NO_DEFAULTS_PATH
  • Cmake macro names are backwards compared to test case names. The macros are $compiler_$machine but test cases are $case.$machine_$compiler. The macro naming format should be changed to match.
  • Make EKAT a shared lib?
  • CMake+Ninja seems to be working a lot better these days. Revist making Ninja the default Cmake backend.

Approach

An incremental approach is always preferable if possible. I think we should be able to move the dependencies, one-by-one, over to the new style.

Dependency list:

TPLs:

  • netcdf (netcdfC, netcdf_fortran, pnetcdf). I believe this dep should come in via spio. This is marked as done since a workaround for a good scorpio-config is in place.
  • hdf5. This should come in through netcdf which should come in from spio
  • blas/lapack
  • MPI

Sharedlibs:

  • csm_share
  • gptl. I believe this dep should come in via spio. This is marked as done since a workaround for a good scorpio-config is in place.
  • kokkos
  • mct
  • spio (requires a spio-config.cmake in the install area, @jayeshkrishna will work on this). This is marked as done since a workaround for a good scorpio-config is in place.
  • cprnc (optional). We don't build against this, so I think the current state is fine for now.
  • MPI-serial (optional). I believe this dep should come in via spio. It has a FindMPISERIAL module. This is marked as done since a workaround for a good scorpio-config is in place.

Optional TPLs

  • MKL. This should come in with blas/lapack via BLA_VENDOR being set to intel.
  • COSP. Built as part of eam if USE_COSP is on. Cosp is eam code so this is fine.
  • YAKL. Currently built as part of eam if USE_YAKL is on. This is probably good enough for now until multiple components decide they want to use YAKL. This could become a sharedlib.
  • ESMF. I don't think E3SM ever uses this.
  • Trilinos
  • Albany
  • SAMXX. Currently built as part of eam if USE_SAMXX is on. This is fine since SAMXX is eam code.
  • RRTMGPXX. Currently built as part of eam if USE_RRTMGPXX is on. This is fine since rrtmgp is eam code.
  • FMS
  • PETSC
  • ADIOS. I believe this dep should come in via spio. This is marked as done since a workaround for a good scorpio-config is in place.
  • MOAB. Helped @rljacob do this the right way on his MOAB branch.
  • HIP. Used through Kokkos/YAKL. I think things are fine.
  • SYCL. Used through Kokkos/YAKL. I think things are fine.

Updating Sharedlibs to not require CIME/Tools/Makefile so we can change our cmake macros to use cmake names

  • PIO/SCORPIO
  • CPRNC
  • MCT
  • mpi-serial
  • GPTL
  • csm_share

Testing

We have to be very careful when refactoring the build system to not change the core semantics of how things get built. I used to have a script, way back in the initial cmake effort, that parsed the build log and created a map of targets and flags. This map could be compared against maps from other logs to do a global compare to ensure flags are identical or at least semantically identical. I'd also like to set up some tests for various use case, for example, ensuring that a user can use a custom PACKAGE_ROOT. I'm not sure the best way to do this yet... perhaps I can do this as an e3sm-specific part of the CIME regression suite or I can make a new test suite.

Related PRs

#5948
#5950
#5959
ESMCI/cime#4492
#5964
#5972
#5975
#5976
#5990
#5993
#5995
#5997
#5998
#6005
#6009
#6025
#6028
#6014
#6069

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

@jonbob , GH wouldn't let me assign you. I think it tops out at 10 assignees. I think you might be interested in this as well. I think the MPAS cmake is in decent shape and probably will only require minor mods.

@bartgol
Copy link
Contributor

bartgol commented Sep 21, 2023

I strongly support this effort. Having e3sm follow state-of-the-art (and relatively well known) patterns will make life MUCH easier for future code maintainers onboarding, as well as for debugging or adding features.

I have a branch that tries to make scorpio a cmake installable package. I was trying to do it to use scorpio inside another lib I am working on, which needs to be installed as a cmake pkg (and therefore must have all its deps as cmake pkgs). I can revive it, if that helps with getting sharedlibs to the standard we need.

I am not sure how to do this incrementally, while keeping (some) bwd compatibility. I fear that we will have to break bwd compatibility almost right off the bat. But maybe I'm wrong.

I'd also like to set up some tests for various use case, for example, ensuring that a user can use a custom PACKAGE_ROOT. I'm not sure the best way to do this yet... perhaps I can do this as an e3sm-specific part of the CIME regression suite or I can make a new test suite.

Can we just add a test mod with shell_commands that sets an env var? Or does the CCS run shell_commands in a subshell, hence losing any env mod it may do?

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

@bartgol , I was more speaking of where to put the test itself rather than how to test it. I think the "how" is pretty easy, just set PACKAGE_ROOT and then scan the log to make sure it used it.

@jonbob
Copy link
Contributor

jonbob commented Sep 21, 2023

Thanks @jgfouca -- I'll keep an eye on it, but I know it's in good hands

@sarats
Copy link
Member

sarats commented Sep 21, 2023

@grnydawn FYI, in case this impacts how you are doing CMake setup for Omega in E3SM-Project#31

@rljacob rljacob assigned vijaysm and grnydawn and unassigned mt5555 and rljacob Sep 21, 2023
@rljacob
Copy link
Member

rljacob commented Sep 21, 2023

What would backward compatibility mean for this?

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

@rljacob , we don't want the semantics of the build to change, we just want a cleaner cmake driving it.

@bartgol
Copy link
Contributor

bartgol commented Sep 21, 2023

Rob, I guess stuff like the content of mach_compiler.cmake. We will likely change the name of some cmake vars, to be more cmake-y, which means that existing mach files on some forks (not currently in master) may no longer work.

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

@rljacob could you take a glance at the Optional TPL list and let me know if you see anything that we don't use anymore?

@philipwjones
Copy link
Contributor

@jgfouca As we develop Omega (YAKL-based) and as other codes may potentially migrate the C++ way... some TPLs like yaml-cpp and spdlog will be used across components too. These are currently buried under ekat/extern. It would be great to promote these (YAKL, sdplog, yaml-cpp) to the shared libs level and treated the same. Looking forward to a better Cmake build.

@rljacob
Copy link
Member

rljacob commented Sep 21, 2023

What are "MKL" and "FMS" TPLs? I can't remember why SAMXX, RTMGPXX and COSP are treated like TPLs and if we want to keep doing that.

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

@rljacob , MKL is the intel Math Kernel Library. I have no idea what FMS is; I was hoping you knew.

@jgfouca
Copy link
Member Author

jgfouca commented Sep 21, 2023

There are very few references to FMS in E3SM or CIME. I think it's almost certain we aren't using it. The best clue I got is this python code:

CIME/build.py:

    if ocn_model == "mom" or (atm_dycore and atm_dycore == "fv3"):
	cmake_args += " -DUSE_FMS=TRUE "

driver-mct/cime_config/buildexe:

    if ocn_model == 'mom' or atm_model == "ufsatm":
        gmake_opts += "USE_FMS=TRUE"

@rljacob
Copy link
Member

rljacob commented Sep 21, 2023

Ok then it refers to Flexible Modeling System, a GFDL coupling layer. mom and fv3 are GFDL models. Probably can't remove that from cime/CIME but the E3SM-owned code doesn't need it.

@bartgol
Copy link
Contributor

bartgol commented Sep 21, 2023

@jgfouca As we develop Omega (YAKL-based) and as other codes may potentially migrate the C++ way... some TPLs like yaml-cpp and spdlog will be used across components too. These are currently buried under ekat/extern. It would be great to promote these (YAKL, sdplog, yaml-cpp) to the shared libs level and treated the same. Looking forward to a better Cmake build.

@philipwjones on my to-do list I have the task of making ekat use pre-built TPLs (if available) via find_package, and only proceed to build them if not found. I also want to switch to a ExternalProject_Add approach, rather than the submodule one, since it will considerably speed up our repo cloning (and to some extent even its size).

@rljacob
Copy link
Member

rljacob commented Sep 21, 2023

IIRC, @jonbob advocated for backwards compatibility when cmake was first introduced but I don't recall why. Something about MPAS standalone build?

@rljacob
Copy link
Member

rljacob commented Sep 21, 2023

Oh yeah the Intel MKL. I think thats optional because its not useful/available on all platforms. Plenty of machines in config_machines are loading the mkl module.

@jonbob
Copy link
Contributor

jonbob commented Sep 21, 2023

@rljacob -- yes, the old build system gave us some different information and capabilities when the MPAS components were all separate submodules. But that's no longer an issue

@bartgol
Copy link
Contributor

bartgol commented Dec 22, 2023

@jgfouca We should add a nano cmake task: change the find_xyz calls to either a) use HINTS instead of PATHS or b) add NO_DEFAULT_PATH that disable search in system folders.

On my laptop (ok, I get it, it's my laptop, but the problem may appear somewhere else too), I have a netcdf installed in the system, which gets picked up before the one I manually installed. The system one does not have the .mod file installed along with the .inc file, so the find_path finds netcdf.inc in /usr/include, but then at build time I get a compiler error b/c netcdf.mod is not found.

The issue is that in find_path (netcdf_f_incdir netcdf.inc REQUIRED PATHS ${NETCDF_FORTRAN_PATH}/include), the user provided path is searched after the default paths. OTOH, if you do find_path (netcdf_f_incdir netcdf.inc REQUIRED HINTS ${NETCDF_FORTRAN_PATH}/include), the user provided path will be searched before the default paths. CMake docs state that

Search the paths specified by the HINTS option. These should be paths computed by system introspection, such as a hint provided by the location of another item already found. Hard-coded guesses should be specified with the PATHS option.

but the fact that one is searched before default paths, and the other is searched after is, in my opinion, a massive difference, to the point that it's the only thing that matters...

As I mentioned above, another solution is to add NO_DEFAULT_PATH to the call. Considering that we pass a path (which we grab from machine specs files), perhaps this is the best solution. We don't want any default path to interfere with settings in cime_config/machines/cmake_macros/xyz. Quite the opposite: we only trust what comes from the macros files. I think this is also what CMake suggests at the very bottom of the find_path doc page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment