Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI Reproducibility issues using hpc-stack NetCDF4 and HDF5 on WCOSS Dell machines #149

Closed
MichaelLueken opened this issue Jan 27, 2021 · 53 comments
Labels
bug Something isn't working

Comments

@MichaelLueken
Copy link

Describe the bug
While running the GSI on Mars and Venus, the results from an executable built using hpc-stack modules don't reproduce an executable using NCO's modules. However, if the hpc-stack built executable uses NCO's NetCDF4 and HDF5 modules, then the results reproduce those from the NCO only executable.

To Reproduce
1)Clone NOAA-EMC:master as the control and then clone MichaelLueken-NOAA:master for the experiment.
2)Under the MichaelLueken-NOAA:master clone, update modulefiles/modulefile.ProdGSI.wcoss_d to use hpc-stack:

module load lsf/10.1
module use /usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-ips/18.0.1.163
module load hpc-impi/18.0.1
module load prod_util/1.2.2
module load bufr/11.4.0
module load ip/3.3.3
module load nemsio/2.5.2
module load sfcio/1.4.1
module load sigio/2.3.2
module load sp/2.3.3
module load w3nco/2.4.1
module load w3emc/2.7.3
module load bacio/2.4.1
module load crtm/2.3.0
module load cmake/3.10.0
module load python/3.6.3
module load netcdf/4.7.4
module load hdf5/1.10.6

The results from this test won't reproduce NOAA-EMC:master.

If you use:

module load lsf/10.1
module use /usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-ips/18.0.1.163
module load hpc-impi/18.0.1
module load prod_util/1.2.2
module load bufr/11.4.0
module load ip/3.3.3
module load nemsio/2.5.2
module load sfcio/1.4.1
module load sigio/2.3.2
module load sp/2.3.3
module load w3nco/2.4.1
module load w3emc/2.7.3
module load bacio/2.4.1
module load crtm/2.3.0
module load cmake/3.10.0
module load python/3.6.3
module load NetCDF-parallel/4.7.4
module load HDF5-parallel/1.10.6

This will reproduce NOAA-EMC:master.

Expected behavior
Executables built using NCO modules should reproduce those built using hpc-stack. Is the problem with hpc-stack, or is there an issue with the current NCO built NetCDF-parallel and HDF5-parallel modules?

System:
This is an issue on WCOSS Dell machines. On Hera, NCEPLIBS built modules reproduce hpc/1.0.0-beta1 modules, which also reproduce hpc/1.1.0 modules.

@MichaelLueken MichaelLueken added the bug Something isn't working label Jan 27, 2021
@Hang-Lei-NOAA
Copy link
Contributor

hpc-stack installation of netcdf associates, have this option for build "--disable-shared " for static installation.

However,

The NCO did not have that option. Therefore in their installation, it is optional.
configure:15297: checking whether to build shared libraries
configure:15322: result: yes
configure:15325: checking whether to build static libraries
configure:15329: result: yes
else ifeq ($$(VALGRIND_ENABLED),yes)
enable_shared='yes'
enable_static='yes'

@edwardhartnett
Copy link
Contributor

Are you look at the executable to see if they are identical? Or are you running them to see if they produce identical output?

@MichaelLueken
Copy link
Author

I'm running them to see if they produce identical output.

@kgerheiser
Copy link
Contributor

@Hang-Lei-NOAA you think the difference is because of static vs shared libraries? I'm not sure that would account for a difference in results.

How can we see how HDF5/NetCDF was built for NCO? To start with we can compare the build options.

@Hang-Lei-NOAA
Copy link
Contributor

You can go to their source directory. You can compare the config.log file in their src. with our build settings.

@edwardhartnett
Copy link
Contributor

Netcdf also leaves a libnetcdf.settings file, which summarizes the build.

@kgerheiser
Copy link
Contributor

The build settings look identical between hpc-stack and NCO builds, except NCO building shared libraries.

I built a shared library HDF5/NetCDF hpc-stack build at:

module use /gpfs/dell2/emc/modeling/noscrub/Kyle.Gerheiser/hpc-stack/netcdf-install/modulefiles/stack

@MichaelLueken-NOAA could you try your test using those modules? I'm not confident that's the problem, but it's worth checking.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 27, 2021 via email

@kgerheiser
Copy link
Contributor

@Hang-Lei-NOAA did it work? Was that the reason for the difference?

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 27, 2021 via email

@MichaelLueken
Copy link
Author

@kgerheiser The tests using your shared library HDF5/NetCDF hpc-stack build reproduces NCO's modules. So, it seems as though the issue is with the shared vs static HDF5/NetCDF libraries.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 28, 2021 via email

@edwardhartnett
Copy link
Contributor

That's not good, since there should be no output difference between a shared and static build...

@kgerheiser
Copy link
Contributor

The exact changes I made were to make zlib, HDF5, and NetCDF shared libraries, and to remove the -fPIC flags in those libraries. My thought is that some optimization done by the compiler is different when using static or shared libraries.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 28, 2021 via email

@kgerheiser
Copy link
Contributor

I'm fine having -fPIC. It was just something to test because it can change the optimizations that the compiler can perform which could affect the results.

@RussTreadon-NOAA
Copy link

What's the path forward from here? Is the difference due to shared -vs- static or the presence -vs- absence of -fPIC or both? As a test can we create hpc-stack with

  • zlib, HDF5, and NetCDF as shared libraries and include the -fPIC flag?
  • zlib, HDF5, and NetCDF as static libraries and remove the -fPIC flag?

Which test makes more sense to run? Are there different and better tests to run?

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 28, 2021 via email

@RussTreadon-NOAA
Copy link

OK, can you create hpc-stack with the first configuration

  • zlib, HDF5, and NetCDF as shared libraries and include the -fPIC flag

@kgerheiser
Copy link
Contributor

Yes, I'll do that.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 28, 2021 via email

@MichaelLueken
Copy link
Author

@Hang-Lei-NOAA The hpc-stack test version of HDF5/NetCDF also reproduces NCO's HDF5/NetCDF libraries.

@RussTreadon-NOAA
Copy link

Thanks, Mike. DA regression tests on Venus are yielding the same result. MichaelLueken-NOAA:master built with hpc-stack test module hpc/1.0.0-beta1 is reproducing regression test results from NOAA-EMC:master built with NCO GFS v16 modules.

@Hang-Lei-NOAA
Copy link
Contributor

Hang-Lei-NOAA commented Jan 28, 2021 via email

@RussTreadon-NOAA
Copy link

When will hpc-stack test hpc/1.0.0-beta be promoted to a non-beta release?

@kgerheiser
Copy link
Contributor

The 1.0.0-beta1 release is a version behind. We are currently on version 1.1.0.

@RussTreadon-NOAA
Copy link

So 1.0.0.-beta1 is a non-starter for moving forward. My bad. These are the tests we've run thus far

  • with hpc/1.0.0-beta DA apps can reproduce what will be implemented with GFS v16
  • with hpc/1.1.0 DA apps do not reproduce what will be implemented with GFS v16
  • a test version of hpc/1.1.0 with shared libraries and -fPIC removed reproduce what will be implemented with GFS v16

Assuming the above is a correct summary, I guess we're back to test of hpc/1.1.0 with zlib, HDF5, and NetCDF as shared libraries and include the -fPIC flag. Does this sound reasonable?

@edwardhartnett
Copy link
Contributor

Seemingly we have different results with shared vs. static builds. I don't think that Kyle and Hang can debug this problem in your code.

We know that we have installed the libraries correctly, and they have passed their tests. It's not clear what else we can do.

The best approach would be some unit tests in the GSI package to isolate the problem. But this is beyond the scope of the NCEPLIBS team.

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 28, 2021

I updated my hpc-stack test build with both shared and static libraries (with -fPIC) for HDF5 and NetCDF.

module use /gpfs/dell2/emc/modeling/noscrub/Kyle.Gerheiser/hpc-stack/netcdf-install/modulefiles/stack

module load netcdf/4.7.4-shared (loads shared)
module load netcdf/4.7.4-static (loads static)

Same thing for HDF5

@MichaelLueken
Copy link
Author

@kgerheiser The GSI fails to compile with the new shared and static HDF5 and NetCDF libraries. For static, the following causes compilation failure:

ld: cannot find -lnetcdff
ld: cannot find -lhdf5_hl
ld: cannot find -lhdf5
ld: cannot find -lnetcdf
ld: cannot find -lhdf5_hl
ld: cannot find -lhdf5

Also, while CMake is configuring, there are messages:

CMake Warning at src/ncdiag/CMakeLists.txt:19 (add_executable):
Cannot generate a safe runtime search path for target ncdiag_cat_mpi.x
because files in some directories may conflict with libraries in implicit
directories:

runtime library [libz.so.1] in /usr/lib64 may be hidden by files in:
  /gpfs/dell2/emc/modeling/noscrub/Kyle.Gerheiser/hpc-stack/netcdf-install/ips-18.0.1.163/zlib/1.2.11/lib

Some of these libraries may not be found correctly.

For the shared libraries:

ld: cannot find -lnetcdff
ld: cannot find -lnetcdf

and the same CMake Warning messages as the static build:

CMake Warning at src/ncdiag/CMakeLists.txt:19 (add_executable):
Cannot generate a safe runtime search path for target ncdiag_cat_mpi.x
because files in some directories may conflict with libraries in implicit
directories:

runtime library [libz.so.1] in /usr/lib64 may be hidden by files in:
  /gpfs/dell2/emc/modeling/noscrub/Kyle.Gerheiser/hpc-stack/netcdf-install/ips-18.0.1.163/zlib/1.2.11/lib

Some of these libraries may not be found correctly.

The GSI uses CMake to build executables. While attempting to load hpc-stack modules, it was discovered that the current FindNetCDF.cmake module wasn't working, so the FindNetCDF.cmake module from NOAA-EMC/CMakeModules project was brought in. Is it possible that the issue is being caused by the new FindNetCDF.cmake module file? The current FindNetCDF.cmake module file in the NOAA-EMC/GSI project has no search criteria for static vs shared libraries, but there are in the new module file.

@MichaelLueken
Copy link
Author

Out of curiosity, what changes were made between the hpc-stack built NetCDF and HDF5 libraries vs NCO (WCOSS_D) and NCEPLIBS (Hera) that require the use of a new CMake module? If the GSI's original FindNetCDF.cmake module could work with the hpc-stack built libraries, everything might work without an issue.

@kgerheiser
Copy link
Contributor

The linking problem is my fault. I thought I would be able build both static and shared and just change the module/install path. Looks like that doesn't work.

What do you mean about a new CMake module?

@MichaelLueken
Copy link
Author

The GSI fails to build using hpc-stack, unless the FindNetCDF.cmake module in:

https://github.com/NOAA-EMC/GSI/blob/master/cmake/Modules/FindNetCDF.cmake

is replaced with the FindNetCDF.cmake module in:

https://github.com/NOAA-EMC/CMakeModules/blob/develop/Modules/FindNetCDF.cmake

So, since a new version of the FindNetCDF.cmake module is required to build the GSI using hpc-stack, it would appear that the new version might be the culprit behind the weird reproducibility issue between static and shared HDF5 and NetCDF libraries.

@edwardhartnett
Copy link
Contributor

Do the two different FindNetCDF files find different versions of the library? Or do they set some other options that would explain the difference in output? Because just shared vs. static should not produce any changes in output. Both netCDF and HDF5 have thousands of tests, which run in both static and shared builds. If there was a difference in output, we would know about it.

@aerorahul
Copy link
Contributor

The FindNetCDF.cmake in the GSI should be replaced with the one in CMakeModules.
The later uses nc-config, nf-config and ncxx4-config to determine the libraries and their dependencies. The GSI FindNetCDF.cmake will not return the HDF5 and Zlib libraries used to link with netcdf library.
The CMakeModules version also supports the user request for STATIC or SHARED version of the library; the GSI one does not.

@aerorahul
Copy link
Contributor

The error reported in comment #149 (comment) is because the GSI's FindNetCDF.cmake does not return where the dependencies are.

@MichaelLueken
Copy link
Author

@aerorahul The error you noted was using CMakeModules FindNetCDF.cmake. As a matter of fact, the entire reproducibility issue is due to CMakeModules FindNetCDF.cmake module. For static libraries, this FindNetCDF.cmake module file adds extra "-lm" entries during the dynamic linking of the final executables. The use of different math libraries are leading to non-reproducible results.

Many thanks go to @kgerheiser, @Hang-Lei-NOAA, and @edwardhartnett for the assistance that you have given to Russ and myself.

@aerorahul
Copy link
Contributor

Ok. So this is not a hpc-stack issue.

@RussTreadon-NOAA
Copy link

build/src/gsi/CMakeFiles/global_gsi.x.dir/load.txt contains

../../lib/libgsilib_shrd.a -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/lib -lnetcdff -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/lib -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/lib -lhdf5_hl -lhdf5 -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/zlib/1.2.11/lib -lz -ldl -lm -lnetcdf -lhdf5_hl -lhdf5 -lm -lz

Note the two -lm. Mark Potts suggested manually removing the -lm and rerunning link.txt. This was done for the global_gsi.x and global_enkf.x load.txt. The DA regression tests were resubmitted using the NOAA-EMC/GSI master as the control and the above modified hpc/1.1.0 DA build as the update.

The regression tests are still running on Mars. 8 of the 19 have completed. All 8 have passed. 1 of the 8 is global_enkf.x. The other 7 are global_gsi.x. I'll let the remaining tests run.

What changes are needed in the DA cmake file(s) so that -lm is NOT included in link.txt?

@kgerheiser
Copy link
Contributor

Ah, that would explain it.

Could you run ldd on the executables and see the exact math library that is being linked to that's different?

I think -lm comes from nc-config which is used internally in FindNetCDF.cmake.

@RussTreadon-NOAA
Copy link

I'm not sure what to look for. I see libm and libmkl listed in "ldd global_gsi.x" from both builds.

The hpc/1.1.0 build with -lm found twice in load.txt has libm.so.6 listed as the third library. libmkl_intel_lp64.so, libmkl_intel_thread.so, libmkl_core.so, and libiomp5.so are listed 8th to 11th.

The re-link without -lm has libm.so.6 as the 11th entry after libmkl_intel_lp64.so, libmkl_intel_thread.so, libmkl_core.so, and libiomp5.so which are the 7th to 10th entries, respectively.

Does order in which libraries are listed matter?

@kgerheiser
Copy link
Contributor

kgerheiser commented Jan 29, 2021

You would want to look at whether the shared libraries each executable is linking to at runtime is the same, like /lib64/libm.so. Which shared libraries are used also depend on current environment as well, so you want to have the same modules that are loaded as when running the executable when you run ldd.

And that is also why we avoid shared libraries.

@RussTreadon-NOAA
Copy link

Apply Mark Potts suggestion in reserve to WCOSS_D build of NOAA-EMC/GSI master.

  • add -lm to master/build/src/gsi/CMakeFiles/global_gsi.x.dir/load.txt
  • execute load.txt
  • run global_gsi.x in v16 based rungsi script
  • run global_gsi.x built from GSI PR #107 using hpc/1.1.0 on WCOSS_D using same rungsi script
  • two GSI analyses are identical

Thus, two ways to get identical results

  1. remove -lm from hpc/1.1.0 GSI build. analysis is identical to master GSI build
  2. add -lm to master GSI build. analysis is identical to hpc/1.1.0 GSI build

Interestingly on Hera, master build without -lm is identical to hpc/1.10 build with -lm. Why? It's the same GSI repos built with the same CMake files. The Hera built uses Intel compiler 18.0.5.274. The WCOSS_D build uses 18.0.1.163. Does this explain the difference? Also need to look into the details of the Hera -vs- WCOSS_D GSI builds. Different compiler options?

Going back to WCOSS_D, diff of netcdf/4.7.4 nc-config between NCO and hpc/1.1.0 returns

[Russ.Treadon@m71a1 ~]$ diff /usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4/bin/nc-config /gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/bin/nc-config
7c7
< prefix=/usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4
---
> prefix=/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4
13,14c13,14
< cflags="-I${includedir} -I/usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4/include"
< libs="-L${libdir} -lnetcdf"
---
> cflags="-I${includedir}  -I/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/include"
> libs="-L${libdir} -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/lib -lhdf5_hl -lhdf5   -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/zlib/1.2.11/lib  -lz -ldl -lm   -lnetcdf -lhdf5_hl -lhdf5 -lm -lz "

Why does NCO libs not include -lm? Why does hpc/1.1.0 libs include -lm twice?

I'm not arguing that hpc/1.1.0 is wrong. I'm trying to understand why NCO netcdf/4.7.4 is built as it is. Is NCO's build correct?

@RussTreadon-NOAA
Copy link

The Hera GSI build specifies -fp-model source whereas the WCOSS_D build specifies -fp-model strict. Rebuild GSI master and pr107 on WCOSS_D with -fp-model source. Now the GSI master built with NCO modules and GSI pr107 built with hpc/1.1.0 yield identical analyses for 14 of 19 DA regression tests running on WCOSS_D. 5 tests are still running.

Interestingly, the Hera and WCOSS_D EnKF builds both specify -fp-model strict. I'll have to dig through issues for reasons why Hera GSI build uses -fp-model source.

@RussTreadon-NOAA
Copy link

The Hera GSI build specifies -fp-model source whereas the WCOSS_D build specifies -fp-model strict. Rebuild GSI master and pr107 on WCOSS_D with -fp-model source. Now the GSI master built with NCO modules and GSI pr107 built with hpc/1.1.0 yield identical analyses for 14 of 19 DA regression tests running on WCOSS_D. 5 tests are still running.

Interestingly, the Hera and WCOSS_D EnKF builds both specify -fp-model strict. I'll have to dig through issues for reasons why Hera GSI build uses -fp-model source.

The above statement is WRONG. I was not testing the configurations I thought I was testing.

Mike merged NOAA-EMC/GSI PR#107 into the NOAA-EMC/GSI master at ade4c6b. This merge brought hpc/1.1.0 into the modulefile used for the Hera build. It also updated cmake files used for builds on all platforms. Given this the following was done to build NOAA-EMC/GSI master on Venus using hpc/1.1.0:

  • clone NOAA-EMC/GSI master
  • update modulefile.ProdGSI.wcoss_d to use hpc/1.1.0
  • build global_gsi.x using hpc/1.1.0

The above was repeated for NOAA-EMC/GSI master using NCO's production v16 modules

  • clone NOAA-EMC/GSI master
  • leave modulefile.ProdGSI.wcoss_d alone (use NCO modules)
  • build global_gsi.x using NCO modules

Both executables were used to run the 2021020112 gdas case on Venus. The resulting global_gsi.x analyses differ. The differences show up in the initial total penalty terms. Of the printed 17 digits, the first 14 are identical between the two runs. The last 3 digits differ. By the start of the second outer loop 8 of 17 digits for the total penalties are identical. As a result, the final analyses are not b4b identical.

The above builds were modified by replacing -fp-model strict with -fp-model source. The 2021020112 case was rerun for each recompiled executable. The two analyses differ from one another and are identical to their -fp-model strict counterpart. Using the strict or source for -fp-model does not alter the GSI analysis.

It was noted that when building global_gsi.x with hpc/1.1.0 -lm is added to the GSI link step. When building global_gsi.x with NCO's modules, -lm is not present in the GSI link step.

The following combination of builds create global_gsi.x executables that generate identical analyses:

  • build with NCO modules and add -lm to the GSI link step
  • build with hpc/1.1.0 with no changes

Another way to get identical global_gsi.x analyses is to build as follows:

  • build with NCO modules with no changes
  • build with hpc/1.1.0 with -lm removed from the GSI link step

Why does building with hpc/1.1.0 add -lm to the link step?

The NCO v16 netcdf and hdf5 modules are NetCDF-parallel/4.7.4 and HDF5-parallel/1.10.6. These were loaded and nc-config and nf-config were executed with the option --all to capture the NetCDF NCO configurations. The hpc/1.1.0 netcdf and hdf5 modules are netcdf/4.7.4 and hdf5/1.10.6. These were loaded and nc-config and nf-config were executed with the option --all to capture the NetCDF hpc/1.1.0 configurations.

Ignoring the path differences, nf-config shows the following differences

NCO
--cflags -> -I/gpfs/dell1/usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4/include -I/gpfs/dell1/usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4/include

hpc/1.1.0
--cflags -> -I/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/include -I/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/include

hpc/1.1.0 cflags lists netcdf and hdf5. NCO cflags repeats netcdf twice. Why the repetition?

Ignoring the path differences, nc-config shows more differences

NCO

  • cflags and cxx4flags list netcdf twice
  • libs only lists netcdf -lnetcdf
  • cxx4 is g++
  • cxx4flags lists netcdf twice
  • cxx4libs lists netcdf -lnetcdf_c++4 -lnetcdf

hpc/1.1.0

  • cflags and cxx4flags list netcdf and hdf5
  • libs lists hdf5 -lhdf5_hl -lhdf5, zlib -lz -ldl -lm, and netcdf -lnetcdf
  • cxx4 is mpiicpc
  • cxx4libs lists hdf5 -lhdf5_hl -lhdf5, zlib -lz -ldl -lm, netcdf -lnetcdf -lnetcdf_c++4
  • cxx4flags lists netcdf and hdf5

I do not know what all these settings mean. I also don't know which set of configuration settings is correct.

Which installation of netcdf/4.7.4 and hdf5/1.10.6 is correct?

I assume hpc/1.1.0 is correct, but the to-be-implemented GFS v16 package uses NCO's installation.

@kgerheiser
Copy link
Contributor

kgerheiser commented Feb 4, 2021

I don't know why they would use g++ as their C++ compiler. CXX might not have been set and it picks up g++ as the default?

Does GSI use NetCDF C++?

I'm looking closely at the C++ flags. I can tell where each part is coming from, but I'm not sure why flags are being repeated. Something might be off there.

Like -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/zlib/1.2.11/lib -lz -ldl -lm is part of HDF5's Extra Libraries, but it's listed twice.

However, that doesn't ultimately explain why the linking order makes for different results. Even with the repeats it looks like the correct libraries would be picked up.

NCO repeats the flags in CFLAGS because they install HDF5 and NetCDF into the same directory. It's pointing to both, just like hpc-stack, but they're in the same place.

And libs is different because of for shared libraries you don't need to manually link in the dependencies so -L<NetCDF> -lnetcdff is all you need. HDF5, etc are dynamically linked.

I'm thinking a different math library gets picked up somehow. Even without the -lm a math library is being linked in somewhere.

@RussTreadon-NOAA
Copy link

Thanks, Kyle, for the reply.

I'm not sure if the following answers your question, "Does GSI use NetCDF C++?".

I did not add the netCDF interface to the GSI. Nor did I add the cmake build system, but here is what I see & did:

src/gsi/CMakelists.txt includes both ${NETCDF_Fortran_LIBRARIES} and ${NETCDF_C_LIBRARIES} as target_link_libraries. I removed ${NETCDF_C_LIBRARIES} and the build failed.

Have any other developers reported different results when building their apps with NCO/IBM netcdf -vs- hpc/1.1.0 netcdf on WCOSS_D?

@aerorahul
Copy link
Contributor

@kgerheiser @RussTreadon-NOAA
GSI does not use C++ NetCDF api. It uses the Fortran API. The Fortran API depends on the C library.

@RussTreadon-NOAA
Copy link

Rahul suggested some changes which I'm incrementally making in a working copy of the NOAA-EMC/GSI master. Now global_gsi.x built from NCO/IBM modules and hpc/1.1.0 modules yield identical initial penalties. Results differ in the minimization.

@RussTreadon-NOAA
Copy link

I tinkered with src/gsi/CMakeLists.txt in a working copy of the NOAA-EMC/GSI master. As noted before adding -lm to target_link_libraries, allows a global_gsi.x built using NCO/IBM modules to reproduce output from a global_gsi.x built from hpc/1.1.0.

Ran a series of tests in which -lm is inserted at various locations on the target_link_libraries line for NCO/IBM module build of global_gsi.x. Found that global_gsi.x results vary based on placement of -lm. Adding -lm before variable $CORE_LIBRARIES creates a global_gsi.x that generates results identical to global_gsi.x built from hpc/1.1.0. Moving -lm immediately after $CORE_LIBRARIES creates a global_gsi.x which reproduces global_gsi.x result from unaltered NCO/IBM build of global_gsi.x. $CORE_LIBRARIES is list of the following production libraries

CORE_LIBRARIES set to /gpfs/dell1/nco/ops/nwprod/lib/bacio/v2.0.3/ips/18.0.1/libbacio_v2.0.3_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64.a;
/gpfs/dell1/nco/ops/nwprod/lib/sigio/v2.1.0/ips/18.0.1/libsigio_v2.1.0_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/nemsio/v2.2.4/ips/18.0.1/impi/18.0.1/libnemsio_v2.2.4.a;
/gpfs/dell1/nco/ops/nwprod/lib/crtm/v2.3.0/ips/18.0.1/libcrtm_v2.3.0.a;
/gpfs/dell1/nco/ops/nwprod/lib/sp/v2.0.3/ips/18.0.1/libsp_v2.0.3_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/sfcio/v1.0.0/ips/18.0.1/libsfcio_v1.0.0_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/w3emc/v2.4.0/ips/18.0.1/impi/18.0.1/libw3emc_v2.4.0_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/w3nco/v2.2.0/ips/18.0.1/libw3nco_v2.2.0_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/ip/v3.0.2/ips/18.0.1/libip_v3.0.2_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/bacio/v2.0.3/ips/18.0.1/libbacio_v2.0.3_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64.a;
/gpfs/dell1/nco/ops/nwprod/lib/sigio/v2.1.0/ips/18.0.1/libsigio_v2.1.0_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/nemsio/v2.2.4/ips/18.0.1/impi/18.0.1/libnemsio_v2.2.4.a;
/gpfs/dell1/nco/ops/nwprod/lib/crtm/v2.3.0/ips/18.0.1/libcrtm_v2.3.0.a;
/gpfs/dell1/nco/ops/nwprod/lib/sp/v2.0.3/ips/18.0.1/libsp_v2.0.3_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/sfcio/v1.0.0/ips/18.0.1/libsfcio_v1.0.0_4.a;
/gpfs/dell1/nco/ops/nwprod/lib/w3emc/v2.4.0/ips/18.0.1/impi/18.0.1/libw3emc_v2.4.0_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/w3nco/v2.2.0/ips/18.0.1/libw3nco_v2.2.0_d.a;
/gpfs/dell1/nco/ops/nwprod/lib/ip/v3.0.2/ips/18.0.1/libip_v3.0.2_d.a

I need to figure out why the libraries are listed twice in $CORE_LIBRARIES.

How does -lm interact with the above libraries?
Has anyone else observed sensitivity to executable results based on link order on WCOSS_D?

@RussTreadon-NOAA
Copy link

One final test. Repeat the above exercise with global_gsi.x built using hpc/1.1.0 modules. With hpc/1.1.0 -lm is included in $NETCDF_C_LIBRARIES

NETCDF_C_LIBRARIES=-L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/lib 
-L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/lib -lhdf5_hl -lhdf5 
-L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/zlib/1.2.11/lib -lz -ldl -lm -lnetcdf -lhdf5_hl -lhdf5 -lm -lz

Build hpc/1.1.0 global_gsi.x with $NETCDF_C_LIBRARIES before $CORE_LIBRARIES. global_gsi.x reproduces results from original hpc/1.1.0 global_gsi.x executable.

Move $NETCDF_C_LIBRARIES after $CORE_LIBRARIES, recompile, and rerun. global_gsi.x reproduces results from original NCO/IBM global_gsi.x executable.

Tests indicate that -lm interacts with one more more libraries in $CORE_LIBRARIES in such a way that global_gsi.x results are altered on WCOSS_D.

@edwardhartnett
Copy link
Contributor

Probably what is happening is that one of the CORE_LIBRARIES also defines a math function that is in lm. So the order of libraries will determine which function gets used. If the functions are slightly different, this could explain the difference in output.

@RussTreadon-NOAA
Copy link

Thank you, Ed, for your reply. I agree, though I do not know which library in $CORE_LIBRARIES might be defining a math function also found in lm. Maybe it's sp. I won't dig into this at present.

It would be useful to note in the hpc-stack wiki or a readme file that hpc/1.1.0 module netcdf/4.7.4 defines $NETCDF_C_LIBRARIES as

-L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/netcdf/4.7.4/lib -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/impi-18.0.1/hdf5/1.10.6/lib -lhdf5_hl -lhdf5 -L/gpfs/dell2/usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/ips-18.0.1.163/zlib/1.2.11/lib -lz -ldl -lm -lnetcdf -lhdf5_hl -lhdf5 -lm -lz

whereas the NCO/IBM GFS v16 module NetCDF-parallel/4.7.4 defines $NETCDF_C_LIBRARIES as

-L/gpfs/dell1/usrx/local/prod/packages/ips/18.0.1/impi/18.0.1/netcdf/4.7.4/lib -lnetcdf

The hpc/1.1.0 module adds -lm and other libraries to $NETCDF_C_LIBRARIES. This difference can impact results generated by source code built from hpc/1.1.0 -versus- the NCO/IBM GFS v16 modules.

I think this issue may be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants