Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

floating overflow with T62_oQU240.GMPAS-IAF on cori-knl w debug intel #1309

Closed
ndkeen opened this issue Mar 15, 2017 · 20 comments
Closed

floating overflow with T62_oQU240.GMPAS-IAF on cori-knl w debug intel #1309

ndkeen opened this issue Mar 15, 2017 · 20 comments
Assignees

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Mar 15, 2017

The create newcase is:

create_newcase -case /global/cscratch1/sd/ndk/acme_scratch/SMS.T62_oQU240.GMPAS-IAF.cori-knl_intel.m27n01t02debugstats -res T62_oQU240 -mach cori-knl -compiler intel -compset GMPAS-IAF -project acme --walltime=00:30:00

The run completes 5 days in optimized builds (without DEBUG=TRUE) at about 16 SYPD.

For the following, I placed all components on the same node with 64 MPI's and adjusted PIO stride. This is with 2 threads.

26: forrtl: error (72): floating overflow
26: Image              PC                Routine            Line        Source
26: acme.exe           00000000072D3171  Unknown               Unknown  Unknown
26: acme.exe           00000000072D12AB  Unknown               Unknown  Unknown
26: acme.exe           000000000727FBE4  Unknown               Unknown  Unknown
26: acme.exe (deleted  000000000727F9F6  Unknown               Unknown  Unknown
26: acme.exe           00000000071FF9C9  Unknown               Unknown  Unknown
26: acme.exe           000000000720BFBC  Unknown               Unknown  Unknown
26: acme.exe           0000000006BBF0E0  Unknown               Unknown  Unknown
26: acme.exe           00000000068C84ED  m_matattrvectmul_         267  m_MatAttrVectMul.F90
26: acme.exe           00000000068CFC80  m_matattrvectmul_         600  m_MatAttrVectMul.F90
26: acme.exe           00000000007ABBDF  seq_map_mod_mp_se         883  seq_map_mod.F90
26: acme.exe           00000000007975F3  Unknown               Unknown  Unknown
26: acme.exe           000000000055F298  Unknown               Unknown  Unknown
26: acme.exe           000000000042A107  cesm_comp_mod_mp_        1970  cesm_comp_mod.F90
26: acme.exe           000000000044408E  MAIN__                     62  cesm_driver.F90
26: acme.exe           000000000040B41E  Unknown               Unknown  Unknown
26: acme.exe (deleted  00000000072EFEA0  Unknown               Unknown  Unknown
26: acme.exe           000000000040B307  Unknown               Unknown  Unknown

I ran the same thing again with 1 thread and got a different failure:

46: forrtl: error (65): floating invalid
46: Image              PC                Routine            Line        Source
46: acme.exe           000000000702FA91  Unknown               Unknown  Unknown
46: acme.exe (deleted  000000000702DBCB  Unknown               Unknown  Unknown
46: acme.exe           0000000006FDC544  Unknown               Unknown  Unknown
46: acme.exe (deleted  0000000006FDC356  Unknown               Unknown  Unknown
46: acme.exe (deleted  0000000006F5F5C6  Unknown               Unknown  Unknown
46: acme.exe (deleted  0000000006F6B2BC  Unknown               Unknown  Unknown
46: acme.exe (deleted  0000000006C170B0  Unknown               Unknown  Unknown
46: acme.exe           00000000037469B7  ocn_global_stats_        1328  mpas_ocn_global_stats.f90
46: acme.exe           00000000035E7E96  ocn_analysis_driv        1163  mpas_ocn_analysis_driver.f90
46: acme.exe           00000000035DE652  ocn_analysis_driv         603  mpas_ocn_analysis_driver.f90
46: acme.exe           000000000355A790  ocn_comp_mct_mp_o         776  ocn_comp_mct.f90
46: acme.exe           000000000045B0B5  component_mod_mp_         676  component_mod.F90
46: acme.exe           000000000043958B  cesm_comp_mod_mp_        3233  cesm_comp_mod.F90
46: acme.exe           0000000000443FDB  MAIN__                     67  cesm_driver.F90
46: acme.exe           000000000040B41E  Unknown               Unknown  Unknown
46: acme.exe           000000000704C7E0  Unknown               Unknown  Unknown
46: acme.exe (deleted  000000000040B307  Unknown               Unknown  Unknown
@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

Did they fail or time out?

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

Looks like they aborted. I don't see any mention of them hitting time limit.

@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

can you point me at the case and run directories?

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

/global/cscratch1/sd/ndk/acme_scratch/SMS.T62_oQU240.GMPAS-IAF.cori-knl_intel.m27n01t01debugstats

and

/global/cscratch1/sd/ndk/acme_scratch/SMS.T62_oQU240.GMPAS-IAF.cori-knl_intel.m27n01t02debugstats

@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

What codebase are you using? Something isn't adding up...

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

master from a few weeks ago. should i update?

@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

maybe so -- it looks like your codebase is inconsistent with the scripts that build the mpas namelists. The ocn logs have errors that look like:
Error: Config config_AM_regionalStats_enable not found in pool.
but the namelist has regionalStats set up as daily/monthly instead. Is it possible you didn't update the submodules after updating? If not, you may just want to grab the newest master -- except that will bring you headlong into the CIME52 world.

@rljacob
Copy link
Member

rljacob commented Mar 15, 2017

You could update to tag v1.0.0-beta.1 which is right before the cime5.2 update.

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

I always update submodules. Note that this is working with 60to30, for what it's worth.
Did something change that might affect this test recently?
Looks like this is a master from Feb 13th

@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

I don't know -- it just confuses me that you're getting these error messages and I don't see any reference to that config flag anywhere in your codebase... So maybe try updating to the hash that Rob suggested and see if we still get this?

@vanroekel
Copy link
Contributor

I just encountered a similar error to what @ndkeen has seen here on edison in EC60to30v3 with cime5.2, but it doesn't seem to be crashing my run.

@jonbob
Copy link
Contributor

jonbob commented Mar 15, 2017

The error with the config flag?

@vanroekel
Copy link
Contributor

yes. Here is the tail of ocn.log

MPAS I/O: Truncating existing data in output file mpaso.hist.0001-01-01_00000.nc
 Error: Sub-pool surfaceSalinityMonthlyForcing_forcing_input not found in pool.
 ... Updating 0d char field xtime in stream
 ... found 0d char named xtime
 ... done updating field
 ... Updating 1d real field surfaceSalinityMonthlyClimatologyValue in stream
 ... found 1d real named surfaceSalinityMonthlyClimatologyValue
 ... done updating field
 Doing timestep 0001-01-01_00:45:00
 Error: Config config_AM_regionalStats_enable not found in pool.
 Error: Config config_AM_regionalStats_enable not found in pool.
 Error: Config config_AM_regionalStats_enable not found in pool.
 Completed timestep 0001-01-01_00:45:00
 Doing timestep 0001-01-01_01:00:00
 Error: Config config_AM_regionalStats_enable not found in pool.
 Error: Config config_AM_regionalStats_enable not found in pool.
 Error: Config config_AM_regionalStats_enable not found in pool.
 Completed timestep 0001-01-01_01:00:00
 Error: Config config_AM_regionalStats_enable not found in pool.
 Error: Config surfaceSalinityMonthlyClimatologyValue not found in pool.
 ERROR: Requested field surfaceSalinityMonthlyClimatologyValue not in stream sur
 face_salinity_monthly_data
 ERROR: -- Forcing: setup_input_fields: MPAS_stream_mgr_remove_field:
          -3
 Forcing: Error: MPAS_stream_mgr_remove_field

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

I had a more recent master available (march 9th) and fired off another 1 node test. I get a slightly different error.

49:  ----- done parsing run-time I/O from streams.cice -----
49: 
49: forrtl: severe (194): Run-Time Check Failure. The variable 'cice_analysis_driver_mp_cice_precompute_analysis_members_$ERR_TMP' is being used in 'mpas_cice_analysis_driver.f90(1278,7)' without being defined
49: Image              PC                Routine            Line        Source             
49: acme.exe           000000000299BAC5  Unknown               Unknown  Unknown
49: acme.exe (deleted  00000000029912C6  Unknown               Unknown  Unknown
49: acme.exe           0000000000834B01  ice_comp_mct_mp_i         790  ice_comp_mct.f90
49: acme.exe           000000000045B0B5  component_mod_mp_         676  component_mod.F90
49: acme.exe           0000000000431137  cesm_comp_mod_mp_        2552  cesm_comp_mod.F90
49: acme.exe           0000000000443FDB  Unknown               Unknown  Unknown
49: acme.exe (deleted  000000000040B41E  Unknown               Unknown  Unknown
49: acme.exe           00000000070642E0  Unknown               Unknown  Unknown
49: acme.exe           000000000040B307  Unknown               Unknown  Unknown

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 15, 2017

And just to clarify, if I add:
config_am_globalstats_enable = .false.
to user_nl_mpaso, it does not have a problem for the master of feb13th.

I also tried turning off those glob stats for the march9th master, and I still get the same error as above. So 2 different errors I presume.

@vanroekel
Copy link
Contributor

@jonbob and @ndkeen the regional stats error seen in these runs is a bug in the analysis driver that has been fixed in ocean/develop, but has not yet propagated to ACME. It does not crash the run though. It is just a printed statement.

@mark-petersen
Copy link
Contributor

@ndkeen, on the second error of your first post, using one node and two threads, you have an error:

46: forrtl: error (65): floating invalid
46: acme.exe           00000000037469B7  ocn_global_stats_        1328  mpas_ocn_global_stats.f90

This is here:

      if (totalVolumeChange == 0.0_RKIND) then
         relativeFreshWaterConservation = 0.0_RKIND
      else
         relativeFreshWaterConservation = (totalVolumeChange - netFreshwaterInput)/totalVolumeChange
      endif

I bet the problem on that one is that totalVolumeChange is very small but not exactly zero. Is that setup still available to you? If so, retest with this change:

line 1325 of components/mpas-o/model/src/core_ocean/analysis_members/mpas_ocn_global_stats.F
      if (abs(totalVolumeChange) < 1e-12_RKIND) then

That would be a safe change to make regardless, and I can do that.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 27, 2017

Mark: just looking thru old issues. I guess we are just waiting for your change to propagate to ACME?

@jonbob
Copy link
Contributor

jonbob commented Aug 1, 2017

@ndkeen - the changes should have propagated to ACME in May. Do you want to retest? Or close?

@mark-petersen
Copy link
Contributor

Yes, the change is in ACME, and takes care of the floating invalid in mpas_ocn_global_stats.f90 in the description at the top. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants