-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSI Reproducibility issues using hpc-stack NetCDF4 and HDF5 on WCOSS Dell machines #149
Comments
hpc-stack installation of netcdf associates, have this option for build "--disable-shared " for static installation. However, The NCO did not have that option. Therefore in their installation, it is optional. |
Are you look at the executable to see if they are identical? Or are you running them to see if they produce identical output? |
I'm running them to see if they produce identical output. |
@Hang-Lei-NOAA you think the difference is because of static vs shared libraries? I'm not sure that would account for a difference in results. How can we see how HDF5/NetCDF was built for NCO? To start with we can compare the build options. |
You can go to their source directory. You can compare the config.log file in their src. with our build settings. |
Netcdf also leaves a libnetcdf.settings file, which summarizes the build. |
The build settings look identical between hpc-stack and NCO builds, except NCO building shared libraries. I built a shared library HDF5/NetCDF hpc-stack build at:
@MichaelLueken-NOAA could you try your test using those modules? I'm not confident that's the problem, but it's worth checking. |
I also made the testing version the same settings to NCO’s, you can test
Kyle installation first. If issue continue, you can try the testing
versions. The links were on github wiki. Let us know your results. Thanks
…On Wed, Jan 27, 2021 at 5:53 PM Kyle Gerheiser ***@***.***> wrote:
The build settings look identical between hpc-stack and NCO builds, except
NCO building shared libraries.
I built a shared library HDF5/NetCDF hpc-stack build at:
module use
/gpfs/dell2/emc/modeling/noscrub/Kyle.Gerheiser/hpc-stack/netcdf-install/modulefiles/stack
@MichaelLueken-NOAA <https://github.com/MichaelLueken-NOAA> could you try
your test using those modules? I'm not confident that's the problem, but
it's worth checking.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFALR7GHYVUU4T3KEMTS4CKNRANCNFSM4WVQZT4A>
.
|
@Hang-Lei-NOAA did it work? Was that the reason for the difference? |
As you compared, that is the difference. Let’s wait for their test results.
…On Wed, Jan 27, 2021 at 6:55 PM Kyle Gerheiser ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> did it work? Was that
the reason for the difference?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFBFSEPPF6XKG6FLWC3S4CRYJANCNFSM4WVQZT4A>
.
|
@kgerheiser The tests using your shared library HDF5/NetCDF hpc-stack build reproduces NCO's modules. So, it seems as though the issue is with the shared vs static HDF5/NetCDF libraries. |
Good, That is it.
…On Thu, Jan 28, 2021 at 9:12 AM MichaelLueken-NOAA ***@***.***> wrote:
@kgerheiser <https://github.com/kgerheiser> The tests using your shared
library HDF5/NetCDF hpc-stack build reproduces NCO's modules. So, it seems
as though the issue is with the shared vs static HDF5/NetCDF libraries.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFCVZV5RWZEQVLLGBW3S4FWC7ANCNFSM4WVQZT4A>
.
|
That's not good, since there should be no output difference between a shared and static build... |
The exact changes I made were to make |
-fPIC should be kept anyway, this is a change made two years ago for
accessing CCPP.
The test version is a very close match of all settings in NCO installed
libraries.
…On Thu, Jan 28, 2021 at 9:53 AM Kyle Gerheiser ***@***.***> wrote:
The exact changes I made were to make zlib, HDF5, and NetCDF shared
libraries, and to remove the -fPIC flags in those libraries. My thought
is that some optimization done by the compiler is different when using
static or shared libraries.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFEUJA3AJLTKCMF55Q3S4F24ZANCNFSM4WVQZT4A>
.
|
I'm fine having |
What's the path forward from here? Is the difference due to shared -vs- static or the presence -vs- absence of
Which test makes more sense to run? Are there different and better tests to run? |
-fpic has been added two years before hpc-stack occurs.
…On Thu, Jan 28, 2021 at 11:51 AM RussTreadon-NOAA ***@***.***> wrote:
What's the path forward from here? Is the difference due to shared -vs-
static or the presence -vs- absence of -fPIC or both? As a test can we
create hpc-stack with
- zlib, HDF5, and NetCDF as shared libraries and include the -fPIC
flag?
- zlib, HDF5, and NetCDF as static libraries and remove the -fPIC flag?
Which test makes more sense to run? Are there different and better tests
to run?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFC7DFEX4SGSCXMOLWTS4GIXHANCNFSM4WVQZT4A>
.
|
OK, can you create
|
Yes, I'll do that. |
The existing test version is that.
https://github.com/NOAA-EMC/hpc-stack/wiki/HPC-stack-test-installations
You can first try this, and then see if it works.
…On Thu, Jan 28, 2021 at 12:00 PM Kyle Gerheiser ***@***.***> wrote:
Yes, I'll do that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFDBZDX62KJFGRTS4GDS4GJ25ANCNFSM4WVQZT4A>
.
|
@Hang-Lei-NOAA The hpc-stack test version of HDF5/NetCDF also reproduces NCO's HDF5/NetCDF libraries. |
Thanks, Mike. DA regression tests on Venus are yielding the same result. MichaelLueken-NOAA:master built with hpc-stack test module hpc/1.0.0-beta1 is reproducing regression test results from NOAA-EMC:master built with NCO GFS v16 modules. |
The changes in this version are following Fanglin's request to match the
NCO operational use.
…On Thu, Jan 28, 2021 at 1:26 PM RussTreadon-NOAA ***@***.***> wrote:
Thanks, Mike. DA regression tests on Venus are yielding the same result.
MichaelLueken-NOAA:master <https://github.com/MichaelLueken-NOAA/GSI>
built with hpc-stack test module hpc/1.0.0-beta1 is reproducing regression
test results from NOAA-EMC:master <https://github.com/NOAA-EMC/GSI> built
with NCO GFS v16 modules.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#149 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFC7UDSIQCM6J5U4NULS4GT45ANCNFSM4WVQZT4A>
.
|
When will |
The |
So
Assuming the above is a correct summary, I guess we're back to test of |
Seemingly we have different results with shared vs. static builds. I don't think that Kyle and Hang can debug this problem in your code. We know that we have installed the libraries correctly, and they have passed their tests. It's not clear what else we can do. The best approach would be some unit tests in the GSI package to isolate the problem. But this is beyond the scope of the NCEPLIBS team. |
I updated my hpc-stack test build with both shared and static libraries (with -fPIC) for HDF5 and NetCDF.
Same thing for HDF5 |
@kgerheiser The GSI fails to compile with the new shared and static HDF5 and NetCDF libraries. For static, the following causes compilation failure: ld: cannot find -lnetcdff Also, while CMake is configuring, there are messages: CMake Warning at src/ncdiag/CMakeLists.txt:19 (add_executable):
Some of these libraries may not be found correctly. For the shared libraries: ld: cannot find -lnetcdff and the same CMake Warning messages as the static build: CMake Warning at src/ncdiag/CMakeLists.txt:19 (add_executable):
Some of these libraries may not be found correctly. The GSI uses CMake to build executables. While attempting to load hpc-stack modules, it was discovered that the current FindNetCDF.cmake module wasn't working, so the FindNetCDF.cmake module from NOAA-EMC/CMakeModules project was brought in. Is it possible that the issue is being caused by the new FindNetCDF.cmake module file? The current FindNetCDF.cmake module file in the NOAA-EMC/GSI project has no search criteria for static vs shared libraries, but there are in the new module file. |
Out of curiosity, what changes were made between the hpc-stack built NetCDF and HDF5 libraries vs NCO (WCOSS_D) and NCEPLIBS (Hera) that require the use of a new CMake module? If the GSI's original FindNetCDF.cmake module could work with the hpc-stack built libraries, everything might work without an issue. |
The linking problem is my fault. I thought I would be able build both static and shared and just change the module/install path. Looks like that doesn't work. What do you mean about a new CMake module? |
The GSI fails to build using hpc-stack, unless the FindNetCDF.cmake module in: https://github.com/NOAA-EMC/GSI/blob/master/cmake/Modules/FindNetCDF.cmake is replaced with the FindNetCDF.cmake module in: https://github.com/NOAA-EMC/CMakeModules/blob/develop/Modules/FindNetCDF.cmake So, since a new version of the FindNetCDF.cmake module is required to build the GSI using hpc-stack, it would appear that the new version might be the culprit behind the weird reproducibility issue between static and shared HDF5 and NetCDF libraries. |
Do the two different FindNetCDF files find different versions of the library? Or do they set some other options that would explain the difference in output? Because just shared vs. static should not produce any changes in output. Both netCDF and HDF5 have thousands of tests, which run in both static and shared builds. If there was a difference in output, we would know about it. |
The |
The error reported in comment #149 (comment) is because the GSI's |
@aerorahul The error you noted was using CMakeModules FindNetCDF.cmake. As a matter of fact, the entire reproducibility issue is due to CMakeModules FindNetCDF.cmake module. For static libraries, this FindNetCDF.cmake module file adds extra "-lm" entries during the dynamic linking of the final executables. The use of different math libraries are leading to non-reproducible results. Many thanks go to @kgerheiser, @Hang-Lei-NOAA, and @edwardhartnett for the assistance that you have given to Russ and myself. |
Ok. So this is not a hpc-stack issue. |
build/src/gsi/CMakeFiles/global_gsi.x.dir/load.txt contains
Note the two The regression tests are still running on Mars. 8 of the 19 have completed. All 8 have passed. 1 of the 8 is global_enkf.x. The other 7 are global_gsi.x. I'll let the remaining tests run. What changes are needed in the DA cmake file(s) so that |
Ah, that would explain it. Could you run I think |
I'm not sure what to look for. I see libm and libmkl listed in "ldd global_gsi.x" from both builds. The hpc/1.1.0 build with -lm found twice in load.txt has libm.so.6 listed as the third library. libmkl_intel_lp64.so, libmkl_intel_thread.so, libmkl_core.so, and libiomp5.so are listed 8th to 11th. The re-link without -lm has libm.so.6 as the 11th entry after libmkl_intel_lp64.so, libmkl_intel_thread.so, libmkl_core.so, and libiomp5.so which are the 7th to 10th entries, respectively. Does order in which libraries are listed matter? |
You would want to look at whether the shared libraries each executable is linking to at runtime is the same, like And that is also why we avoid shared libraries. |
Apply Mark Potts suggestion in reserve to WCOSS_D build of NOAA-EMC/GSI master.
Thus, two ways to get identical results
Interestingly on Hera, master build without Going back to WCOSS_D, diff of netcdf/4.7.4 nc-config between NCO and hpc/1.1.0 returns
Why does NCO I'm not arguing that hpc/1.1.0 is wrong. I'm trying to understand why NCO netcdf/4.7.4 is built as it is. Is NCO's build correct? |
The Hera GSI build specifies Interestingly, the Hera and WCOSS_D EnKF builds both specify |
The above statement is WRONG. I was not testing the configurations I thought I was testing. Mike merged NOAA-EMC/GSI PR#107 into the NOAA-EMC/GSI master at ade4c6b. This merge brought hpc/1.1.0 into the modulefile used for the Hera build. It also updated cmake files used for builds on all platforms. Given this the following was done to build NOAA-EMC/GSI master on Venus using hpc/1.1.0:
The above was repeated for NOAA-EMC/GSI master using NCO's production v16 modules
Both executables were used to run the 2021020112 gdas case on Venus. The resulting global_gsi.x analyses differ. The differences show up in the initial total penalty terms. Of the printed 17 digits, the first 14 are identical between the two runs. The last 3 digits differ. By the start of the second outer loop 8 of 17 digits for the total penalties are identical. As a result, the final analyses are not b4b identical. The above builds were modified by replacing It was noted that when building global_gsi.x with hpc/1.1.0 The following combination of builds create global_gsi.x executables that generate identical analyses:
Another way to get identical global_gsi.x analyses is to build as follows:
Why does building with hpc/1.1.0 add -lm to the link step? The NCO v16 netcdf and hdf5 modules are NetCDF-parallel/4.7.4 and HDF5-parallel/1.10.6. These were loaded and nc-config and nf-config were executed with the option Ignoring the path differences, nf-config shows the following differences NCO hpc/1.1.0 hpc/1.1.0 cflags lists netcdf and hdf5. NCO cflags repeats netcdf twice. Why the repetition? Ignoring the path differences, nc-config shows more differences NCO
hpc/1.1.0
I do not know what all these settings mean. I also don't know which set of configuration settings is correct. Which installation of netcdf/4.7.4 and hdf5/1.10.6 is correct? I assume hpc/1.1.0 is correct, but the to-be-implemented GFS v16 package uses NCO's installation. |
I don't know why they would use g++ as their C++ compiler. CXX might not have been set and it picks up g++ as the default? Does GSI use NetCDF C++? I'm looking closely at the C++ flags. I can tell where each part is coming from, but I'm not sure why flags are being repeated. Something might be off there. Like However, that doesn't ultimately explain why the linking order makes for different results. Even with the repeats it looks like the correct libraries would be picked up. NCO repeats the flags in CFLAGS because they install HDF5 and NetCDF into the same directory. It's pointing to both, just like hpc-stack, but they're in the same place. And I'm thinking a different math library gets picked up somehow. Even without the |
Thanks, Kyle, for the reply. I'm not sure if the following answers your question, "Does GSI use NetCDF C++?". I did not add the netCDF interface to the GSI. Nor did I add the cmake build system, but here is what I see & did: src/gsi/CMakelists.txt includes both ${NETCDF_Fortran_LIBRARIES} and ${NETCDF_C_LIBRARIES} as target_link_libraries. I removed ${NETCDF_C_LIBRARIES} and the build failed. Have any other developers reported different results when building their apps with NCO/IBM netcdf -vs- hpc/1.1.0 netcdf on WCOSS_D? |
@kgerheiser @RussTreadon-NOAA |
Rahul suggested some changes which I'm incrementally making in a working copy of the NOAA-EMC/GSI master. Now global_gsi.x built from NCO/IBM modules and hpc/1.1.0 modules yield identical initial penalties. Results differ in the minimization. |
I tinkered with src/gsi/CMakeLists.txt in a working copy of the NOAA-EMC/GSI master. As noted before adding Ran a series of tests in which
I need to figure out why the libraries are listed twice in How does |
One final test. Repeat the above exercise with global_gsi.x built using hpc/1.1.0 modules. With hpc/1.1.0
Build hpc/1.1.0 global_gsi.x with Move Tests indicate that |
Probably what is happening is that one of the CORE_LIBRARIES also defines a math function that is in lm. So the order of libraries will determine which function gets used. If the functions are slightly different, this could explain the difference in output. |
Thank you, Ed, for your reply. I agree, though I do not know which library in It would be useful to note in the hpc-stack wiki or a readme file that hpc/1.1.0 module
whereas the NCO/IBM GFS v16 module
The hpc/1.1.0 module adds I think this issue may be closed. |
Describe the bug
While running the GSI on Mars and Venus, the results from an executable built using hpc-stack modules don't reproduce an executable using NCO's modules. However, if the hpc-stack built executable uses NCO's NetCDF4 and HDF5 modules, then the results reproduce those from the NCO only executable.
To Reproduce
1)Clone NOAA-EMC:master as the control and then clone MichaelLueken-NOAA:master for the experiment.
2)Under the MichaelLueken-NOAA:master clone, update modulefiles/modulefile.ProdGSI.wcoss_d to use hpc-stack:
module load lsf/10.1
module use /usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-ips/18.0.1.163
module load hpc-impi/18.0.1
module load prod_util/1.2.2
module load bufr/11.4.0
module load ip/3.3.3
module load nemsio/2.5.2
module load sfcio/1.4.1
module load sigio/2.3.2
module load sp/2.3.3
module load w3nco/2.4.1
module load w3emc/2.7.3
module load bacio/2.4.1
module load crtm/2.3.0
module load cmake/3.10.0
module load python/3.6.3
module load netcdf/4.7.4
module load hdf5/1.10.6
The results from this test won't reproduce NOAA-EMC:master.
If you use:
module load lsf/10.1
module use /usrx/local/nceplibs/dev/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-ips/18.0.1.163
module load hpc-impi/18.0.1
module load prod_util/1.2.2
module load bufr/11.4.0
module load ip/3.3.3
module load nemsio/2.5.2
module load sfcio/1.4.1
module load sigio/2.3.2
module load sp/2.3.3
module load w3nco/2.4.1
module load w3emc/2.7.3
module load bacio/2.4.1
module load crtm/2.3.0
module load cmake/3.10.0
module load python/3.6.3
module load NetCDF-parallel/4.7.4
module load HDF5-parallel/1.10.6
This will reproduce NOAA-EMC:master.
Expected behavior
Executables built using NCO modules should reproduce those built using hpc-stack. Is the problem with hpc-stack, or is there an issue with the current NCO built NetCDF-parallel and HDF5-parallel modules?
System:
This is an issue on WCOSS Dell machines. On Hera, NCEPLIBS built modules reproduce hpc/1.0.0-beta1 modules, which also reproduce hpc/1.1.0 modules.
The text was updated successfully, but these errors were encountered: