Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radiation diagnostics out of memory crash #2575

Closed
polunma opened this issue Oct 11, 2018 · 15 comments · Fixed by #3932
Closed

Radiation diagnostics out of memory crash #2575

polunma opened this issue Oct 11, 2018 · 15 comments · Fixed by #3932
Assignees

Comments

@polunma
Copy link
Contributor

polunma commented Oct 11, 2018

I have spent more than 2 weeks trying to figure out a model crash. Finally I was able to identify that the crash can be reproduced with current master without any code modification. I just checked out the latest master, made no change to the code, and enabled 10 radiation diagnostics for a run. The model ran for 1 month, wrote out h0 files, and ran a few more days and crashed with an error message stating “out of memory”. Could somebody please help?

@polunma
Copy link
Contributor Author

polunma commented Oct 11, 2018

I forgot to mention that if I restart every 1 month, the model can continue to run.

@rljacob
Copy link
Member

rljacob commented Oct 11, 2018

Can you give some more info on how to reproduce? What is the create_newcase command? How does one enable 10 radiation diagnostics?

@ndkeen
Copy link
Contributor

ndkeen commented Oct 11, 2018

Yes please give more information. Ideally a create_test command (on a specific machine) and explain how to do what is different than the default. If I can repeat it, it's much more likely to make progress. Otherwise, I can only guess: If it is running out of memory, one thing we typically try is running with more nodes and/or with fewer MPI's per node. Try running without threads? If it continues for a month after restart, that does sound interesting -- it implies that there could be a memory leak which might only show after running long enough.

@polunma
Copy link
Contributor Author

polunma commented Oct 11, 2018

Thanks Rob and Noel so much for your help! Balwinder also suspected a memory leak. (He suggested trying to see if the model can continue to run with month-to-month restarts.) Here are some more details if you want to reproduce the crash:

  1. The runs are done on cori-knl. Default settings.
  2. ./create_newcase -case $CASEROOT -mach $MACH -res ne30_ne30 -compset FC5AV1C-04P2 -compiler intel
  3. To do 10 radiation diagnostics, modify the atmosphere model namelist:
    cat <! user_nl_cam
    &camexp
    rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
    'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
    rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
    'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
    'N:CFC11:CFC11', 'N:CFC12:CFC12'
    fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10'
    /
    EOF

@singhbalwinder
Copy link
Contributor

singhbalwinder commented Oct 11, 2018

@polunma : Do you think omitting fincl1 output will still result in a crash?

For those who are not familiar with the scripts that we use to run the model, build the model using the create_newcase command @polunma mentioned and add the text between
"cat <! user_nl_cam
&camexp"

and

"/
EOF"

in the user_nl_cam file in the case directory. That is, add the following in the user_nl_cam:

rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12'
fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10

Please note that path to input data directory is hardwired here (/project/projectdirs/acme/inputdata/) so you would have to change that if you run on any other machine except the NERSC machines.

@yfenganl
Copy link
Contributor

yfenganl commented Oct 11, 2018 via email

@ndkeen
Copy link
Contributor

ndkeen commented Oct 11, 2018

I ran a ne30 F case using only 3 nodes, 67 MPI's per node, 4 threads each on cori-knl (using a repo from Aug21st that has some additional profiling mods). I used the above user_nl_cam (thanks for clarification Balwinder). I asked to run for 2 days. Sure enough, the memory use is increasing steadily over time. (Note, I updated plot below after re-running for 2 complete days).

Does it make sense to try reducing the number of entries in user_nl_cam to see if there is a specific one that causes issue?

prss pernode 15663587 00000

By contrast, here is the same plot for a run made without those radiation entries in user_nl_cam (running for 2 days):

prss pernode 15479787 00000

@ndkeen
Copy link
Contributor

ndkeen commented Oct 11, 2018

Using some even more experimental tools I've been working on, I see that the memory increases more in CAM_run2 than in CAM_run1. Every other call to CAM_run2 is about 10MB, while every call to CAM_run1 increases the peak memory by about 1MB.

I can show a plot, but it's pretty messy.

This is the peak RSS (ie it will only increase) over time for each rank. The measurements are at certain places in the code. Looking at the raw data, I can say the notes above. But it's still good to see the plot. It's nice that rank0 just uses more memory overall, so the blue dots (rank0) stand out. This is different than the above plots -- the memory data is not coming from top, but from a call within the code.

rpeak_per_rank_via_timers dpi1800 j15663587

Image is largish as I created with higher DPI to allow for better zooming in. Might need to download file first for better zooming.

@singhbalwinder
Copy link
Contributor

Thanks @ndkeen ! Those are really clear visualizations. As far as I remember, CAM_run1 calls radiation (tphysbc), which calls these radiation diagnostics. So this tells us something we already know now that the diagnostics are causing this memory leak.

Thanks @yfenganl for reporting on ne120 grid. It might be faster to reproduce this using ne120 as it may already be using a lot of memory.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 11, 2018

FWIW, if I remove the fincl1 line in the user_nl_cam, I still see the same memory behavior.

I also ran without fincl1 line and with only the first rad_diag_1 line in user_nl_cam.
The memory use is substantially less, but it's not clear if there is still a "growth" or not.
I can look more closely if this is a worthwhile avenue.

Also, I ran with DEBUG=TRUE. The job ran out of time, but after many steps there were no errors.

@polunma
Copy link
Contributor Author

polunma commented Oct 16, 2018

Thank you all very much for taking a look! Is there any hope of identifying/fixing the bug soon? BTW @ndkeen , fincl1 line is essential because otherwise the results from radiation diagnostics are not written out.
One of my month-to-month runs failed (error message below). However, the weirdest thing is that I simply resubmit the same job and it was done successfully...
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "CPL:RUN_LOOP" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
0: newchild: child "a:radiation" can't be a parent of itself
525: forrtl: severe (154): array index out of bounds
525: Image PC Routine Line Source
525: e3sm.exe 0000000004B1521E Unknown Unknown Unknown
525: e3sm.exe 00000000043B7F40 Unknown Unknown Unknown
525: e3sm.exe 0000000001874E22 aero_model_mp_mod 3009 aero_model.F90
525: e3sm.exe 00000000018748F8 aero_model_mp_aer 1642 aero_model.F90
525: e3sm.exe 00000000005FC982 physpkg_mp_tphysb 2621 physpkg.F90
525: e3sm.exe 00000000005F5709 physpkg_mp_phys_r 1029 physpkg.F90
525: e3sm.exe 0000000004013103 Unknown Unknown Unknown
525: e3sm.exe 0000000003FCA9E0 Unknown Unknown Unknown
525: e3sm.exe 0000000003FCBFD4 Unknown Unknown Unknown
525: e3sm.exe 0000000003F9D8F4 Unknown Unknown Unknown
525: e3sm.exe 00000000005F524B physpkg_mp_phys_r 1018 physpkg.F90
525: e3sm.exe 00000000004EEEF7 cam_comp_mp_cam_r 250 cam_comp.F90
525: e3sm.exe 00000000004DF4E1 atm_comp_mct_mp_a 522 atm_comp_mct.F90
525: e3sm.exe 00000000004285F4 component_mod_mp_ 728 component_mod.F90
525: e3sm.exe 000000000040E976 cime_comp_mod_mp_ 3370 cime_comp_mod.F90
525: e3sm.exe 00000000004282F3 MAIN__ 103 cime_driver.F90
525: e3sm.exe 000000000040A80E Unknown Unknown Unknown
525: e3sm.exe 0000000004C382D9 Unknown Unknown Unknown
525: e3sm.exe 000000000040A6F9 Unknown Unknown Unknown
srun: error: nid02921: task 525: Exited with exit code 154
srun: Terminating job step 15659352.0
495: slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Resource temporarily unavailable
0: slurmstepd: error: *** STEP 15659352.0 ON nid02906 CANCELLED AT 2018-10-11T05:01:09 ***
505: forrtl: error (78): process killed (SIGTERM)

@ndkeen
Copy link
Contributor

ndkeen commented Oct 25, 2018

I spent more time trying to debug this. I don't have a fix, but I do have some more information.
I tried several things. I did try valgrind but it has not yet been useful (valgrind details below).

Using my own attempts to measure memory by placing calls within the code, I can narrow down where the memory (RSS) is growing/shrinking. I write the current RSS as well as the Peak RSS. Originally, I was tracking the Peak, but this did not lead anywhere as the leak can be elsewhere --
increasing the base memory use, while some other code might allocate/deallocate and be the actual hiwater RSS.

So looking at RSS and tracking when it increases but does not increase, I see:
Every call to CAM_run2 shows an increase in memory with no corresponding decrease.
Within there, it looks like the increase happens in the code within the timer "microp_tend".
However, there's NOT an increase every pass thru microp_tend. Certainly a pattern, but it's
not obvious -- I can write a script and make a plot to show the pattern. I can do that if it helps.

Within the microp_tend code, I think the increase happens in subroutine micro_mg_cam_tend(). Again, not every call, but in a pattern. I have a little more detail inside of this routine, but it is quite large/complicated (well beyond the size of what good SW engineering would suggest, but ...) and I was hoping someone more familiar with it could weigh in.

Running the same case without the radiation diagnostics shows no increase in memory as described above (in fact, the memory is well-behaved across the day).

The reason I wanted to try without the fincl1 line is not because I was suggesting that this could be a solution, but rather to debug. If we remove that line and memory does not increase, it could help narrow down the issue. When I tried it (earlier on), it seemed like the memory behavior is the same.
The same argument with trying fewer radiation diagnostics -- is it possible only one or a few of those diagnostics cause an issue? I haven't tried this yet.

I also tried the same test using
use_hetfrz_classnuc = .false.
As this is something we've been using with the coupled runs and I see that the code is doing something different when using this flag. Originally reported that this crashed, but I was mistaken -- something else I was doing had crashed. Running this again with no issues. Still need to verify that memory issues are similar, but no reason to suspect it will be any different.


Below are notes regarding using valgrind:
I tried to use valgrind with this ne30 problem. I used developed a valgrind suppressions file to limit the output. I noticed something odd, but it turned out it was likely an issue that happens when using valgrind -- when I re-compiled with -heap-arrays (for Intel compiler) that issue went away and the run continued.

Then the run crashes/hangs with errors like this:

118:  ccm kohlerc - no real(r8) solution found (quartic)
118:  roots = (NaN,NaN) (NaN,NaN) (NaN,NaN) (NaN,NaN)
118:  p0-p3 = -8.481208826751322E-009 -9.331906172134355E-005  0.000000000000000E+000
118:   9.360098364365669E-005
118:  rh=  2.497662239924677E-006
118:  setting radius to dry radius=  4.491510796752449E-002

Without valgrind, the run is fine. So I'm not sure what to make of this.
I just haven't had time to figure this out or process the valgrind output (but nothing obvious there).
case:
/global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.intel.n004p0120t30x1.6s.rssp.heaparraysg.vg

I then tried with the GNU compiler and it completes 2 steps with output before timing out. However, the output does not contain the source code info (I did re-compile with -g hoping it would include it)

  8: ==53511== Invalid read of size 16
  8: ==53511==    at 0x170E9D3: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16D6342: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16D70F7: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x16CF367: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)
  8: ==53511==    by 0x169768D: ??? (in /global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg/bld/e3sm.exe)

/global/cscratch1/sd/ndk/acme_scratch/cori-haswell/mprofaug21/f.ne30_ne30.mprofaug21.gnu.n004p0128t32x1.6s.rssp.rad10.heaparraysg.vg

I tried these tests on anvil as well as cori-haswell with both intel and GNU. The versions of compilers and valgrind on cori are more recent.

@ndkeen
Copy link
Contributor

ndkeen commented Sep 30, 2020

Noting that #3866 might address this memory issue. I will test as soon as @singhbalwinder says it's ready. @ndkeen

@ndkeen
Copy link
Contributor

ndkeen commented Oct 1, 2020

When I try @singhbalwinder branch in PR 3468 and use the same user_nl_cam above, but rename to user_nl_eam now, the memory appears almost constant after 3 days whereas before, even with a master as of Sept 24th, the memory was increasing by at least 160MB every day. So I think that PR will fix this issue.

@singhbalwinder
Copy link
Contributor

Thanks @ndkeen for testing it so quickly. I will make note in my PR that it fixes this memory issue.

wlin7 added a commit that referenced this issue Nov 30, 2020
…1 into next (PR #3932)

Fixes and enables radiation diagnostics

Radiation diagnostic calls are enabled where aerosol species mentioned in
radiation diagnostics list (rad_diag_* in atm_in file) can participate in
all the same processes as prognostic radiation calls (mentioned in
rad_climate list in atm_in). The missing processes for the diagnostic calls were:

Aerosol size adjustment
Aitken<->Accumulation aerosol transfer

Enabling these calculations ensure that radiation diagnostic call with
exactly the same species as radiation climate call, produces BFB diagnostic
fields (issue #3468).

Since, radiation diagnostic lists (rad_diag_*) can exclude species or even
an entire mode (or modes), I have relied on rad_cnst_* calls to get info
about mode numbers, number of species in a mode and mode (or species)
properties. rad_cnst_* calls guarantee that only species/modes present in
the rad_diag_* lists are accessed. I have tested the following cases:

Excluding all aerosols
Excluding BC from all modes
Excluding SOA from all modes
Excluding Aitken mode
Excluding Accumulation modes
Radiation diagnostic list exactly the same as radiation climate list

For diagnostic lists, "Aerosol size adjustment" process is always ON but
"Aiken<->Accumulation transfer" is turned off for diagnostic calls where
Aitken or Accumulation mode is absent. I have reworked the mapping so that missing
species in aitken and accumulation modes are accounted for.

The modal_aero_calcsize_sub subroutine is heavily refactored where different
processes are refactored into their own routines (for readability) and similar
calculations are combined together.

This code also fixes the memory leak issue mentioned in #2575. It also fixes
another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley
for finding this!). This PR also cleans up logic for clear_rh variables
following Ben Hillman's suggestions.

Fixes #2575
Fixes #3468
[BFB] (for prognostic radiation calls, the answers will change for the
       diagnostic calls as this PR fixes a bug identified in issue #3468)
wlin7 added a commit that referenced this issue Dec 1, 2020
…1 into next (PR #3932)

Fixes and enables radiation diagnostics

Radiation diagnostic calls are enabled where aerosol species mentioned in
radiation diagnostics list (rad_diag_* in atm_in file) can participate in
all the same processes as prognostic radiation calls (mentioned in
rad_climate list in atm_in). The missing processes for the diagnostic calls were:

Aerosol size adjustment
Aitken<->Accumulation aerosol transfer

Enabling these calculations ensure that radiation diagnostic call with
exactly the same species as radiation climate call, produces BFB diagnostic
fields (issue #3468).

Since, radiation diagnostic lists (rad_diag_*) can exclude species or even
an entire mode (or modes), I have relied on rad_cnst_* calls to get info
about mode numbers, number of species in a mode and mode (or species)
properties. rad_cnst_* calls guarantee that only species/modes present in
the rad_diag_* lists are accessed. I have tested the following cases:

Excluding all aerosols
Excluding BC from all modes
Excluding SOA from all modes
Excluding Aitken mode
Excluding Accumulation modes
Radiation diagnostic list exactly the same as radiation climate list

For diagnostic lists, "Aerosol size adjustment" process is always ON but
"Aiken<->Accumulation transfer" is turned off for diagnostic calls where
Aitken or Accumulation mode is absent. I have reworked the mapping so that missing
species in aitken and accumulation modes are accounted for.

The modal_aero_calcsize_sub subroutine is heavily refactored where different
processes are refactored into their own routines (for readability) and similar
calculations are combined together.

This code also fixes the memory leak issue mentioned in #2575. It also fixes
another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley
for finding this!). This PR also cleans up logic for clear_rh variables
following Ben Hillman's suggestions.

Fixes #2575
Fixes #3468
[BFB] (for prognostic radiation calls, the answers will change for the
       diagnostic calls as this PR fixes a bug identified in issue #3468)
wlin7 added a commit that referenced this issue Dec 1, 2020
…1 into next (PR #3932)

Fixes and enables radiation diagnostics

Radiation diagnostic calls are enabled where aerosol species mentioned in
radiation diagnostics list (rad_diag_* in atm_in file) can participate in
all the same processes as prognostic radiation calls (mentioned in
rad_climate list in atm_in). The missing processes for the diagnostic calls were:

Aerosol size adjustment
Aitken<->Accumulation aerosol transfer

Enabling these calculations ensure that radiation diagnostic call with
exactly the same species as radiation climate call, produces BFB diagnostic
fields (issue #3468).

Since, radiation diagnostic lists (rad_diag_*) can exclude species or even
an entire mode (or modes), I have relied on rad_cnst_* calls to get info
about mode numbers, number of species in a mode and mode (or species)
properties. rad_cnst_* calls guarantee that only species/modes present in
the rad_diag_* lists are accessed. I have tested the following cases:

Excluding all aerosols
Excluding BC from all modes
Excluding SOA from all modes
Excluding Aitken mode
Excluding Accumulation modes
Radiation diagnostic list exactly the same as radiation climate list

For diagnostic lists, "Aerosol size adjustment" process is always ON but
"Aiken<->Accumulation transfer" is turned off for diagnostic calls where
Aitken or Accumulation mode is absent. I have reworked the mapping so that missing
species in aitken and accumulation modes are accounted for.

The modal_aero_calcsize_sub subroutine is heavily refactored where different
processes are refactored into their own routines (for readability) and similar
calculations are combined together.

This code also fixes the memory leak issue mentioned in #2575. It also fixes
another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley
for finding this!). This PR also cleans up logic for clear_rh variables
following Ben Hillman's suggestions.

Fixes #2575
Fixes #3468
[BFB] (for prognostic radiation calls, the answers will change for the
       diagnostic calls as this PR fixes a bug identified in issue #3468)
wlin7 added a commit that referenced this issue Dec 1, 2020
…1 into next (PR #3932)

Fixes and enables radiation diagnostics

Radiation diagnostic calls are enabled where aerosol species mentioned in
radiation diagnostics list (rad_diag_* in atm_in file) can participate in
all the same processes as prognostic radiation calls (mentioned in
rad_climate list in atm_in). The missing processes for the diagnostic calls were:

Aerosol size adjustment
Aitken<->Accumulation aerosol transfer

Enabling these calculations ensure that radiation diagnostic call with
exactly the same species as radiation climate call, produces BFB diagnostic
fields (issue #3468).

Since, radiation diagnostic lists (rad_diag_*) can exclude species or even
an entire mode (or modes), I have relied on rad_cnst_* calls to get info
about mode numbers, number of species in a mode and mode (or species)
properties. rad_cnst_* calls guarantee that only species/modes present in
the rad_diag_* lists are accessed. I have tested the following cases:

Excluding all aerosols
Excluding BC from all modes
Excluding SOA from all modes
Excluding Aitken mode
Excluding Accumulation modes
Radiation diagnostic list exactly the same as radiation climate list

For diagnostic lists, "Aerosol size adjustment" process is always ON but
"Aiken<->Accumulation transfer" is turned off for diagnostic calls where
Aitken or Accumulation mode is absent. I have reworked the mapping so that missing
species in aitken and accumulation modes are accounted for.

The modal_aero_calcsize_sub subroutine is heavily refactored where different
processes are refactored into their own routines (for readability) and similar
calculations are combined together.

This code also fixes the memory leak issue mentioned in #2575. It also fixes
another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley
for finding this!). This PR also cleans up logic for clear_rh variables
following Ben Hillman's suggestions.

Fixes #2575
Fixes #3468
[BFB] (for prognostic radiation calls, the answers will change for the
       diagnostic calls as this PR fixes a bug identified in issue #3468)
wlin7 added a commit that referenced this issue Feb 10, 2021
…1 into next (PR #3932)

Fixes and enables radiation diagnostics

Radiation diagnostic calls are enabled where aerosol species mentioned in
radiation diagnostics list (rad_diag_* in atm_in file) can participate in
all the same processes as prognostic radiation calls (mentioned in rad_climate
list in atm_in). The missing processes for the diagnostic calls were:

Aerosol size adjustment
Aitken<->Accumulation aerosol transfer

Enabling these calculations ensure that radiation diagnostic call with
exactly the same species as radiation climate call, produces BFB diagnostic
fields (issue #3468).

Since, radiation diagnostic lists (rad_diag_*) can exclude species or even
an entire mode (or modes), I have relied on rad_cnst_* calls to get info
about mode numbers, number of species in a mode and mode (or species) properties.
rad_cnst_* calls guarantee that only species/modes present in the rad_diag_*
lists are accessed. I have tested the following cases:

Excluding all aerosols
Excluding BC from all modes
Excluding SOA from all modes
Excluding Aitken mode
Excluding Accumulation modes
Radiation diagnostic list exactly the same as radiation climate list

For diagnostic lists, "Aerosol size adjustment" process is always ON but
"Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken
or Accumulation mode is absent. I have reworked the mapping so that missing
species in aitken and accumulation modes are accounted for.

The modal_aero_calcsize_sub subroutine is heavily refactored where different
processes are refactored into their own routines (for readability) and similar
calculations are combined together.

This code also fixes the memory leak issue mentioned in #2575. It also fixes
another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for
finding this!). This PR also cleans up logic for clear_rh variables following
Ben Hillman's suggestions.

Fixes #2575
Fixes #3468
[BFB] (for prognostic radiation calls, the answers will change for the
      diagnostic calls as this PR fixes a bug identified in issue #3468)
@wlin7 wlin7 closed this as completed in 31f2a75 Feb 11, 2021
jgfouca pushed a commit that referenced this issue Jan 18, 2024
…gate_racecheck_fail

Automatically Merged using E3SM Pull Request AutoTester
PR Title: Add team_barriers to water path tests
PR Author: tcclevenger
PR LABELS: AT: AUTOMERGE, bugfix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants