Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads compiler/machine configs from E3SM CIME #31

Merged
merged 10 commits into from
Oct 10, 2023

Conversation

grnydawn
Copy link

@grnydawn grnydawn commented Aug 25, 2023

Reads compiler/machine configs from E3SM CIME(config_machines.xml and cmake-macros/*.cmake files)

  • uses MPICC from E3SM CIME as a default Omega compiler
  • load machine-specific modules
  • set environmental variables according to machine and compiler configs
  • User can override compiler and compiler/linker options using OMEGA_CXX_COMPILER, OMEGA_CXX_FLAGS, and OMEGA_LINK_FLAGS
  • adds gen_machine_info.py to read config_machines.xml in E3SM CIME
  • drives the unit testing for data types using CMake

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Polaris tests for new features have been added per the approved design (and included in a test suite)
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.
    • Polaris test suite has passed
    • Performance related PRs: Please include a relevant PACE experiment link documenting performance before and after.
  • Stealth Features
    • If any stealth features are included in the PR, please confirm that they have been documented.

@philipwjones
Copy link

@grnydawn this looks good and works great on Chrysalis with intel. However, I did not have much luck on OLCF machines.

On Frontier:
using crayclanggpu as target: Cmake runs fine, but when it gets to MachEnv.h during the make, it can't fine mpi.h (and mpi.h does exist when I manually look in the relevant path)

  so I tried to use just the crayclang without the gpu and the CMake failed trying to test this compiler with a test build:

Error invoking pkg-config!
Package hdf5_hl_cpp was not found in the pkg-config search path.
Perhaps you should add the directory containing `hdf5_hl_cpp.pc'
to the PKG_CONFIG_PATH environment variable
No package 'hdf5_hl_cpp' found
Package hdf5_cpp was not found in the pkg-config search path.
Perhaps you should add the directory containing `hdf5_cpp.pc'
to the PKG_CONFIG_PATH environment variable
No package 'hdf5_cpp' found

which is strange after it worked fine for crayclanggpu...

I tried Summit too and it failed during the build with an unknow flag err (-ML) for a pgi build so it may be that CIME's setup for pgi/nvidia has a bad flag?

@grnydawn
Copy link
Author

@philipwjones , Thank you for the information. I'm currently investigating the issues you pointed out, and I wanted to share some of my initial thoughts:

I encountered the same problem with crayclanggpu. It seems that the issue with crayclanggpu is connected to how the Scream team utilizes compiler configurations. The MPICC for crayclanggpu ends up pointing to hipcc instead of the Cray compiler wrapper, causing it to not recognize the paths for the MPI library. To address this, I need to incorporate MPI library and include paths. While this is feasible, I haven't yet determined the optimal way to provide this information in a manner that works across all systems.

I tested crayclang on Frontier and encountered no issues. It's puzzling that your log includes hdf5-related problems, as no unit tests or other Omega code currently use the hdf5 library.

I attempted the IBM compiler on Summit. Initially, I faced an issue due to a Python version conflict. After loading python/3.8.10, the problem was resolved, and I could successfully execute ctest by using "jsrun -n 1 ctest" on a batch node. I'll continue to explore solutions for the python version issue.

@sarats
Copy link
Member

sarats commented Sep 8, 2023

On Frontier, hipcc was used as the C++ compiler for both MMF and Scream as Cray's C++ compiler (invoked via CC wrapper) lagged a bit in terms of rocm features. The MPI libraries and include paths were specified in the compiler_machine cmake_macros file.

Do I understand correctly that you are only parsing config_machines.xml and not looking at the cmake_macro files? There are certain other things like compiler options/workarounds, PIO hints and misc stuff that is specified in those files. It might be too big of a task to parse everything from there but perhaps we should extract a subset? Or should those be specified as needed in Omega separately?

Or a more radical idea which probably won't work: should we include that cmake_macro file in its entirety? There are certainly some options which won't be applicable and we need to override/ignore some settings.

@sarats
Copy link
Member

sarats commented Sep 8, 2023

Additionally, do we need a Depends.machine_compiler capability to handle exceptions like reduced optimization levels etc?

@grnydawn
Copy link
Author

grnydawn commented Sep 8, 2023

@sarats . Thanks for the comments. Current Omega cmake build do read cmake files in cmake_macros folder according to the machine name and the compiler name collected from config_machines.xml. Therefore Omega build system have all the cmake variables set by the E3SM cmake files. Difficulty comes from the fact that each cmake files use the E3SM cmake variables differently especially in case of GPU builds. For example, one cmake file use SLIB variable to put MPI flags but the other uses LDFLAGS for the purpose. Or other just use MPIFC/MPICXX compiler wrapper. For now, I am thinking to use MPI_PATH variable to set MPI include path and library path and flag, but I am not sure if this approach will work across all systems.

So far, we are not using Depends.machine_compiler yet. As the cmake_macros case, we can select proper Depends.* files according to machines and compiler names collected. But, tt is not clear to me whether we want to use the Depends file or to configure Omega's own way.

@grnydawn
Copy link
Author

grnydawn commented Sep 8, 2023

@philipwjones , I have updated this PR to resolve the issues that you reported. Could you rerun your tests with this update to see how it goes?

Major updates are:

  • MPI directories are collected using Cmake findMPI package(after all module commands are executed)
  • Netcdf and PNetcdf are supported
  • Compiler flags for GPUs are appended to cxx flags(CUDA_FLAGS, HIP_FLAGS, SYCL_FLAGS)

@grnydawn grnydawn requested a review from sarats September 8, 2023 19:24
@philipwjones
Copy link

Ok...it works for crayclanggpu on Frontier, but not for the cpu crayclang target. For the crayclang target, it complains that there is "No such file or directory" called -lstdc++ so I think the -lstdc++ argument must be misplaced and the compiler thinks it's a file that needs to be processed rather than a library. I couldn't find where that string is put together and why it would be different for crayclang when it worked fine for crayclanggpu. Haven't tried other builds yet.

@philipwjones
Copy link

On Summit - cpu build with pgi target complains about unknown switch (-MF) for nvhpc++. pgigpu target has the same error and also has some missing Cmake variables when I try to build the CUDA back end (OMEGA_ARCH=CUDA) - missing CMAKE_CUDA_COMPILE_OBJECT and _CMAKE_CUDA_WHOLE_FLAG.

@grnydawn
Copy link
Author

@philipwjones , Thanks for the checks. I also have the same issue that you pointed. I think I need to dig more on CMake. Also, I may need to setup Cmake "unit tests" to verify my changes in Omega Cmake for multiple systems as well as multiple compilers.

@philipwjones
Copy link

@grnydawn - no problem, it seems to be really close to working everywhere. And I know it's hard to test all the supported compiler and machine options - I was hoping leveraging the E3SM options would make it easier, but even harvesting those seems to be non-trivial.

@xylar
Copy link

xylar commented Sep 19, 2023

@grnydawn, I see that this adds python code to the Omega repo. I'd like to add some linting for python code equivalent to what we use in Polaris. Rather than complicating this PR, I'll add that separately and clean up the python script later.

@xylar
Copy link

xylar commented Sep 19, 2023

@grnydawn, would you mind seeing if you can rebase onto develop to bring in my new linting additions from #34? See if you can follow my instructions from Slack from before, but if not we can work through it together again.

This merge update the capability of using E3SM configurations including compiler and compiler options.

* uses MPICXX from E3SM CIME as a default Omega compiler
* User can specify compiler and compiler/linker options using OMEGA_CXX_COMPILER, OMEGA_CXX_FLAGS, and OMEGA_LINK_OPTIONS
* adds gen_machine_info.py to read config_machines.xml in E3SM CIME
* adds the unit testing for data types
* added new cmake variables of OMEGA_INCLUDE_DIRECTORIES OMEGA_LINK_DIRECTORIES
* better supports compilers that are not wrapped in a compiler wrapper such as hipcc
* uses cmake variables from E3SM for GPU models including  CUDA_FLAGS, HIP, and SYCL
* supports older Python versions
@grnydawn
Copy link
Author

@xylar , I rebased this branch onto develop branch as you suggested. I saw some lint checking for python code in your commits and will make modification in my python code if the CI check detects any.

@xylar
Copy link

xylar commented Sep 20, 2023

Yep, it looks like there are some things to fix:
https://github.com/E3SM-Project/Omega/actions/runs/6250456001/job/16969368282?pr=31
Let me know if you could use my help.

@philipwjones
Copy link

@grnydawn If it's easy to do at this stage, can you also grab the definition of mpiexec from the CIME machine files too? Some of the unit tests (like MachEnv and the upcoming Decomp) require an 8-task MPI run and it would be nice to have the test command be something like ${MPI_EXEC} -n 8 my_unit_test. If you'd rather wait for a later PR to add this, that's fine - only add this now if it's convenient. I'll just use srun in the interim since that works across several machines.

@grnydawn
Copy link
Author

@philipwjones , I think it is not difficult to grab MPI_EXEC from CIME config. file. I need to update my python code anyway to fix the linting issues.

@grnydawn
Copy link
Author

@xylar , I fixed all linting issues except one below:

components/omega/gen_machine_info.py:33:1: E402 module level import not at top of file

I think that I understand why I got this error. I created a class inherited from one of CIME classes whose source file location is passed through command-line argument. I thought this approach is more flexible than hard-coded CIME path. So, can we skip E402 test for the python code?

@xylar
Copy link

xylar commented Sep 21, 2023

@grnydawn, yes, while that approach to imports is not preferable, it is unfortunately necessary for CIME, which doesn't play by normal python rules (ESMCI/cime#3886).

To skip the CI check, please add:

    from CIME.XML.machines import Machines  # noqa: E402

Note that there are 2 spaces before the #.

Copy link
Member

@sarats sarats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go for this stage. We can revisit once CIME CMake changes are finished upstream.

* creates scripts that prerun module commands in the build direcotry:
  omega_build.sh, omega_run.sh and omega_ctest.sh
* YAKL library can be automatically built using HIP or CUDA
* updates documents for cmake-based build.
@grnydawn
Copy link
Author

grnydawn commented Oct 9, 2023

@xylar , I tried to fix some formatting issues found during CI tests, but could not fix them all. Could you look at the latest commits and guide me how to fix them?

@grnydawn
Copy link
Author

grnydawn commented Oct 9, 2023

@philipwjones , I have updated this PR with better supporting GPU and CPU builds on Frontier and Perlmutter. Summit is not supported yet. I think I know what are the issues but it will take a little more time to support Summit.

To build and run the standalone Omega build, I created new scripts in the build directory: "omega_build.sh" for compiling and "omega_ctest.sh" for invoking the ctest and "omega_run.sh" to run the dummy executable. All the scripts "sources" "omega_env.sh" to match the module and env. variables set in "config_machines.xml" of CIME.

Please review this update and let me know if you find any issue.

@grnydawn
Copy link
Author

grnydawn commented Oct 9, 2023

@xylar Thanks for the explanations. I fixed the issues and CI tests are passed now. Sorry that I forget to run the pre-commit command of mamba before pushing commits.

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this across Frontier (crayclang/crayclanggpu), Perlmutter (pm-cpu/nvidia) and Chrysalis (Intel) and it both build and runs CTest successfully. Thanks @grnydawn

components/omega/doc/userGuide/OmegaBuild.md Show resolved Hide resolved
@grnydawn
Copy link
Author

grnydawn commented Oct 10, 2023

@philipwjones Thanks for testing. I updated the user guide per your suggestion of using OMEGA_CIME_MACHINE. Please merge this when you finish your review.

@philipwjones philipwjones merged commit fe7a0e6 into E3SM-Project:develop Oct 10, 2023
2 checks passed
@grnydawn grnydawn deleted the ykim/omega/cmakebuild-cime branch October 18, 2023 15:21
amametjanov pushed a commit that referenced this pull request Dec 5, 2023
Clean up column code in mpas-si icepack driver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake-related issues Omega
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants