Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

Merged
merged 33 commits into from
Aug 15, 2024

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Aug 7, 2024

Summary

Describe the changes made in this PR and why they are needed.

  1. Unrelated change but needed for gcc@13 support: bump odc from 1.4.6 to 1.5.2.

  2. Split configs/common/packages.yaml into a compiler-independent configs/common/packages.yaml and compiler-dependent configs/common/packages_${COMPILER}.yaml; use openblas and fftw as virtual providers for blas, lapack, fftw-api with gnu@ and apple-clang@; use intel-oneapi-mkl with intel@ and oneapi@.

  3. Site config updates for all sites: split packages.yaml into packages_${COMPILER}.yaml and add Intel MKL as external package for intel@ and oneapi@ compilers. Please follow the examples for blackpearl, narwhal, nautilus. Note that updating the site config does not imply testing the update (see section "Testing" below for which tests where done).

Update 2024/08/09: Certain site configs were modified to by default retain the openblas/fftw configuration with Intel - see list below. Steps to switch to the new default MKL configuration are documented in each site's packages_*.yaml.

List of sites:

  • blackpearl (@climbfuji) - uses MKL w/ Intel
  • acorn (@climbfuji) - keeps openblas w/ Intel
  • atlantis (@climbfuji) - uses MKL w/ Intel
  • aws-pcluster (@climbfuji) - keeps openblas w/ Intel
  • casper (@climbfuji) - no changes needed, only has GNU
  • derecho (@climbfuji) - keeps openblas w/ Intel
  • discover-scu16 (@climbfuji) - uses MKL w/ Intel
  • discover-scu17 (@climbfuji) - uses MKL w/ Intel
  • gaea-c5 (@climbfuji) - keeps openblas w/ Intel
  • gaea-c6 (@climbfuji) - keeps openblas w/ Intel
  • hera (@climbfuji) - keeps openblas w/ Intel
  • hercules (@climbfuji) - keeps openblas w/ Intel
  • jet (@climbfuji) - keeps openblas w/ Intel
  • narwhal (@climbfuji) - uses MKL w/ Intel; includes an update of the Intel compiler in the site config
  • nautilus (@climbfuji) - uses MKL w/ Intel
  • NOAA ParallelWorks (@climbfuji) - keeps openblas w/ Intel
  • orion (@climbfuji) - keeps openblas w/ Intel
  • s4 (@climbfuji) - keeps openblas w/ Intel
  • sandy (@climbfuji) - no changes needed, only has GNU
  1. Corresponding documentation updates.

Testing

@fmahebert @srherbener @RatkoVasic-NOAA @natalie-perlin @AlexanderRichert-NOAA I "assigned" this PR to you in case you want to test the updated site configs (see list above) on the system(s) that you are responsible for - this is optional, because we'll go through all platforms in a few weeks anyway when we roll out spack-stack-1.8.0.

  • Built unified environment on blackpearl with gcc@13.3.0 and oneapi@2024.1.2 (@climbfuji)
  • Built neptune standalone environment on Narwhal with gcc@10.3.0 and intel@2021.10.0 (@climbfuji)
  • Built unified environment on Orion with intel@2021.9.0, with openblas (default per site config) and with MKL (manually changed) (@climbfuji)
  • CI

Applications affected

All

Systems affected

All

Dependencies

n/a

Issue(s) addressed

Resolves #759

Checklist

  • This PR addresses one issue/problem/enhancement, or has a very good reason for not doing so.
  • These changes have been tested on the affected systems and applications.
  • All dependency PRs/issues have been resolved and this PR can be merged.

@climbfuji climbfuji changed the title Use MKL as virtual provider for blas, lapack, fftw with Intel compilers (classic and llvm-based/oneapi); update site configs accordingly; skip wgrib2 with Intel oneapi Use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs accordingly; skip wgrib2 with Intel oneapi Aug 7, 2024
@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA I have two environments on Orion:

/work2/noaa/jcsda/dheinzel/spack-stack-feature-oneapi_intel_use_mkl/envs/ue-intel-2021.9.0/install/modulefiles/Core

and

/work2/noaa/jcsda/dheinzel/spack-stack-feature-oneapi_intel_use_mkl/envs/ue-intel-2021.9.0-mkl/install/modulefiles/Core

@climbfuji
Copy link
Collaborator Author

@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA I tried using the current ufs-weather-model develop branch with this spack-stack PR and the two test installs on Orion noted above. I tried the cpld_control_p8_mixedmode_intel regression test. Both the openblas and the mkl builds segfault after/in the ww3 initialization:

> cat out
...
180:        Wave model ...
180:  WW3 log written to /work/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_2304865/cpl
180:  d_control_p8_mixedmode_intel_test_mkl/./log.ww3

> cat err
150: WARNING from PE     0: Unused line in INPUT/MOM_input : ODA_VINC_VAR = 'v_inc'
150:
150:
150: WARNING from PE     0: Unused line in INPUT/MOM_input : ODA_INCUPD_NHOURS = 6
150:
 77: Abort(805361423) on node 77 (rank 77 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 77: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once

I checked and esmf is linked static, mapl shared. I don't know why/how that changed. Remember that spack-stack develop uses newer ESMF and MAPL versions, and I also had to put this workaround in the ufs-weather-model top-level CMakeLIsts.txt to be able to run cmake and make:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index e5fdd1e8..5c7a974a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -148,6 +148,9 @@ endif()

 find_package(NetCDF 4.7.4 REQUIRED C Fortran)
 find_package(ESMF 8.3.0 MODULE REQUIRED)
+if (NOT TARGET ESMF::ESMF)
+       add_library(ESMF::ESMF ALIAS esmf)
+endif ()
 if(FMS)
   find_package(FMS 2022.04 REQUIRED COMPONENTS R4 R8)
   if(APP MATCHES "^(HAFSW)$")

Will try to force static mapl linking tomorrow.

@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA Update. I rebuilt mapl as static libraries and linked against that version. With both MKL and blas, the ufs-weather-model still aborts with the same error and in the same place as described above. I am certain this is unrelated to the changes in this PR, it must have something to do with the update of some packages from spack-stack-1.6.0 to spack-stack-develop.

@climbfuji climbfuji changed the title By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs accordingly; skip wgrib2 with Intel oneapi By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs as needed; skip wgrib2 with Intel oneapi Aug 14, 2024
@climbfuji climbfuji changed the title By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs as needed; skip wgrib2 with Intel oneapi By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 Aug 14, 2024
@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA Update. I rebuilt mapl as static libraries and linked against that version. With both MKL and blas, the ufs-weather-model still aborts with the same error and in the same place as described above. I am certain this is unrelated to the changes in this PR, it must have something to do with the update of some packages from spack-stack-1.6.0 to spack-stack-develop.

@AlexanderRichert-NOAA Yet another update. I compiled the spack-stack develop unified environment, then ufs-weather-model and ran the same test as above. It failed in the same place. Therefore, the problem is not related to this PR and it shouldn't be held up by that.

@AlexanderRichert-NOAA
Copy link
Collaborator

Have you run other UWM RTs with it?

@climbfuji
Copy link
Collaborator Author

Have you run other UWM RTs with it?

I haven't.

@climbfuji
Copy link
Collaborator Author

One thing I had to do in order to compile was:

index e5fdd1e8..5c7a974a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -148,6 +148,9 @@ endif()

 find_package(NetCDF 4.7.4 REQUIRED C Fortran)
 find_package(ESMF 8.3.0 MODULE REQUIRED)
+if (NOT TARGET ESMF::ESMF)
+       add_library(ESMF::ESMF ALIAS esmf)
+endif ()
 if(FMS)
   find_package(FMS 2022.04 REQUIRED COMPONENTS R4 R8)
   if(APP MATCHES "^(HAFSW)$")
diff --git a/tests/logs/RegressionTests_orion.log b/tests/logs/RegressionTests_orion.log
index b08d50a9..eee3eda5 100644

This is because of the findESMF mismatches etc. discussed elsewhere. I don't think this is the cause of the problem, though ...

I'll run one basic regression test next (not sure when I get to it)

@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA control_c48 runs with spack-stack-dev and with this branch. The openblas config (default as per this PR) produces b4b identical results for control_c48 than spack-stack-dev. The mkl run is still stuck in the queue (orion had a power outage).

See ufs-community/ufs-weather-model#2399 where Dusan reports that he gets the same errors I had above with the coupled run when using esmf 8.6.1 and mapl 2.46.2 in spack-stack-1.6.0.

I think at this point there is no reason to hold up this PR.

@climbfuji
Copy link
Collaborator Author

Thanks @AlexanderRichert-NOAA

@climbfuji climbfuji merged commit 7168fec into JCSDA:develop Aug 15, 2024
8 checks passed
@climbfuji climbfuji deleted the feature/oneapi_intel_use_mkl branch August 15, 2024 17:03
@climbfuji
Copy link
Collaborator Author

@AlexanderRichert-NOAA I know this has been merged already, but for the sake of completeness: the control_c48 run with MKL is b4b identical to the openblas run. Of course, this may not be the case for the fully coupled model, but at least for atm standalone atmosphere with some stochastics (cellular automata) it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Switch to Intel MKL on systems with Intel compiler
6 participants