-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Radiation diagnostics out of memory crash #2575
Comments
I forgot to mention that if I restart every 1 month, the model can continue to run. |
Can you give some more info on how to reproduce? What is the create_newcase command? How does one enable 10 radiation diagnostics? |
Yes please give more information. Ideally a create_test command (on a specific machine) and explain how to do what is different than the default. If I can repeat it, it's much more likely to make progress. Otherwise, I can only guess: If it is running out of memory, one thing we typically try is running with more nodes and/or with fewer MPI's per node. Try running without threads? If it continues for a month after restart, that does sound interesting -- it implies that there could be a memory leak which might only show after running long enough. |
Thanks Rob and Noel so much for your help! Balwinder also suspected a memory leak. (He suggested trying to see if the model can continue to run with month-to-month restarts.) Here are some more details if you want to reproduce the crash:
|
@polunma : Do you think omitting For those who are not familiar with the scripts that we use to run the model, build the model using the and "/ in the
Please note that path to input data directory is hardwired here ( |
I ran into a similar issue when I was doing ne120 runs (FC5AVIC-H01A) on Anvil with only one radiation diagnostics call. I checked the memory usage in the log file as Az suggested, and found that the memory use was indeed accumulating as the run continued until it exceeded the maximum memory per node. We didn't got a chance to trace down the problem.
But restarting seemed to solve the problem. I was able to complete one-year runs by restarting before the running time got close to the crash point.
I was using a version checked out in April, 2018. haven't tested it with the new master.
…-Yan
------------------------------------------------------------------------------------
* * * Yan Feng, Ph.D.
* * * * Atmospheric and climate scientist
* * * Argonne National Laboratory
* * * * Environmental Science Division
* * * * * * * 9700 S. Cass Ave., Bldg. 240/7127
* * * * Argonne, IL 60439
* * * * * Tel:630-252-2550 Fax:630-252-2959
Web: http://scholar.google.com/citations?user=-JEIiRAAAAAJ&hl=en
------------------------------------------------------------------------------------
________________________________
From: singhbalwinder <notifications@github.com>
Sent: Thursday, October 11, 2018 1:15 PM
To: E3SM-Project/E3SM
Cc: Subscribed
Subject: Re: [E3SM-Project/E3SM] Radiation diagnostics out of memory crash (#2575)
@polunma<https://github.com/polunma> : Do you think omitting fincl1 output will still result in a crash?
For those who are not familiar with the scripts that we use to run the model, build the model using the create_newcase command @polunma<https://github.com/polunma> mentioned and add the text between
"cat <! user_nl_cam
&camexp"
and
"/
EOF"
in the user_nl_cam file in the case directory. That is add the following in the user_nl_cam:
rad_diag_1 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_2 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_3 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_4 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_5 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_6 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_7 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_8 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_9 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12', 'M:mam4_mode1:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode1_rrtmg_c130628.nc',
'M:mam4_mode2:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode2_rrtmg_c130628.nc', 'M:mam4_mode3:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode3_rrtmg_c130628.nc', 'M:mam4_mode4:/project/projectdirs/acme/inputdata/atm/cam/physprops/mam4_mode4_rrtmg_c130628.nc'
rad_diag_10 = 'A:Q:H2O', 'N:O2:O2', 'N:CO2:CO2',
'A:O3:O3', 'N:N2O:N2O', 'N:CH4:CH4',
'N:CFC11:CFC11', 'N:CFC12:CFC12'
fincl1 ='FSNTC_d1','FLNTC_d1','FSNSC_d1','FLNSC_d1','FSNT_d1','FLNT_d1','FSNS_d1','FLNS_d1','QRS_d1','QRL_d1','SWCF_d1','LWCF_d1','FSNTC_d2','FLNTC_d2','FSNSC_d2','FLNSC_d2','FSNT_d2','FLNT_d2','FSNS_d2','FLNS_d2','QRS_d2','QRL_d2','SWCF_d2','LWCF_d2','FSNTC_d3','FLNTC_d3','FSNSC_d3','FLNSC_d3','FSNT_d3','FLNT_d3','FSNS_d3','FLNS_d3','QRS_d3','QRL_d3','SWCF_d3','LWCF_d3','FSNTC_d4','FLNTC_d4','FSNSC_d4','FLNSC_d4','FSNT_d4','FLNT_d4','FSNS_d4','FLNS_d4','QRS_d4','QRL_d4','SWCF_d4','LWCF_d4','FSNTC_d5','FLNTC_d5','FSNSC_d5','FLNSC_d5','FSNT_d5','FLNT_d5','FSNS_d5','FLNS_d5','QRS_d5','QRL_d5','SWCF_d5','LWCF_d5','FSNTC_d6','FLNTC_d6','FSNSC_d6','FLNSC_d6','FSNT_d6','FLNT_d6','FSNS_d6','FLNS_d6','QRS_d6','QRL_d6','SWCF_d6','LWCF_d6','FSNTC_d7','FLNTC_d7','FSNSC_d7','FLNSC_d7','FSNT_d7','FLNT_d7','FSNS_d7','FLNS_d7','QRS_d7','QRL_d7','SWCF_d7','LWCF_d7','FSNTC_d8','FLNTC_d8','FSNSC_d8','FLNSC_d8','FSNT_d8','FLNT_d8','FSNS_d8','FLNS_d8','QRS_d8','QRL_d8','SWCF_d8','LWCF_d8','FSNTC_d9','FLNTC_d9','FSNSC_d9','FLNSC_d9','FSNT_d9','FLNT_d9','FSNS_d9','FLNS_d9','QRS_d9','QRL_d9','SWCF_d9','LWCF_d9','FSNTC_d10','FLNTC_d10','FSNSC_d10','FLNSC_d10','FSNT_d10','FLNT_d10','FSNS_d10','FLNS_d10','QRS_d10','QRL_d10','SWCF_d10','LWCF_d10
Please note that path to input data directory is hardwired here (/project/projectdirs/acme/inputdata/) so you would have to change that if you run on any other machine except the NERSC machines.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#2575 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AQer0p_kxaZ9bYy7xo-xG8kIUL2fKyv7ks5uj4rVgaJpZM4XW30g>.
|
I ran a ne30 F case using only 3 nodes, 67 MPI's per node, 4 threads each on cori-knl (using a repo from Aug21st that has some additional profiling mods). I used the above user_nl_cam (thanks for clarification Balwinder). I asked to run for 2 days. Sure enough, the memory use is increasing steadily over time. (Note, I updated plot below after re-running for 2 complete days). Does it make sense to try reducing the number of entries in user_nl_cam to see if there is a specific one that causes issue? By contrast, here is the same plot for a run made without those radiation entries in user_nl_cam (running for 2 days): |
Using some even more experimental tools I've been working on, I see that the memory increases more in CAM_run2 than in CAM_run1. Every other call to CAM_run2 is about 10MB, while every call to CAM_run1 increases the peak memory by about 1MB. I can show a plot, but it's pretty messy. This is the peak RSS (ie it will only increase) over time for each rank. The measurements are at certain places in the code. Looking at the raw data, I can say the notes above. But it's still good to see the plot. It's nice that rank0 just uses more memory overall, so the blue dots (rank0) stand out. This is different than the above plots -- the memory data is not coming from top, but from a call within the code. Image is largish as I created with higher DPI to allow for better zooming in. Might need to download file first for better zooming. |
Thanks @ndkeen ! Those are really clear visualizations. As far as I remember, CAM_run1 calls radiation (tphysbc), which calls these radiation diagnostics. So this tells us something we already know now that the diagnostics are causing this memory leak. Thanks @yfenganl for reporting on ne120 grid. It might be faster to reproduce this using ne120 as it may already be using a lot of memory. |
FWIW, if I remove the fincl1 line in the user_nl_cam, I still see the same memory behavior. I also ran without fincl1 line and with only the first rad_diag_1 line in user_nl_cam. Also, I ran with DEBUG=TRUE. The job ran out of time, but after many steps there were no errors. |
Thank you all very much for taking a look! Is there any hope of identifying/fixing the bug soon? BTW @ndkeen , fincl1 line is essential because otherwise the results from radiation diagnostics are not written out. |
I spent more time trying to debug this. I don't have a fix, but I do have some more information. Using my own attempts to measure memory by placing calls within the code, I can narrow down where the memory (RSS) is growing/shrinking. I write the current RSS as well as the Peak RSS. Originally, I was tracking the Peak, but this did not lead anywhere as the leak can be elsewhere -- So looking at RSS and tracking when it increases but does not increase, I see: Within the microp_tend code, I think the increase happens in subroutine micro_mg_cam_tend(). Again, not every call, but in a pattern. I have a little more detail inside of this routine, but it is quite large/complicated (well beyond the size of what good SW engineering would suggest, but ...) and I was hoping someone more familiar with it could weigh in. Running the same case without the radiation diagnostics shows no increase in memory as described above (in fact, the memory is well-behaved across the day). The reason I wanted to try without the fincl1 line is not because I was suggesting that this could be a solution, but rather to debug. If we remove that line and memory does not increase, it could help narrow down the issue. When I tried it (earlier on), it seemed like the memory behavior is the same. I also tried the same test using Below are notes regarding using valgrind: Then the run crashes/hangs with errors like this:
Without valgrind, the run is fine. So I'm not sure what to make of this. I then tried with the GNU compiler and it completes 2 steps with output before timing out. However, the output does not contain the source code info (I did re-compile with -g hoping it would include it)
I tried these tests on anvil as well as cori-haswell with both intel and GNU. The versions of compilers and valgrind on cori are more recent. |
Noting that #3866 might address this memory issue. I will test as soon as @singhbalwinder says it's ready. @ndkeen |
When I try @singhbalwinder branch in PR 3468 and use the same user_nl_cam above, but rename to user_nl_eam now, the memory appears almost constant after 3 days whereas before, even with a master as of Sept 24th, the memory was increasing by at least 160MB every day. So I think that PR will fix this issue. |
Thanks @ndkeen for testing it so quickly. I will make note in my PR that it fixes this memory issue. |
…1 into next (PR #3932) Fixes and enables radiation diagnostics Radiation diagnostic calls are enabled where aerosol species mentioned in radiation diagnostics list (rad_diag_* in atm_in file) can participate in all the same processes as prognostic radiation calls (mentioned in rad_climate list in atm_in). The missing processes for the diagnostic calls were: Aerosol size adjustment Aitken<->Accumulation aerosol transfer Enabling these calculations ensure that radiation diagnostic call with exactly the same species as radiation climate call, produces BFB diagnostic fields (issue #3468). Since, radiation diagnostic lists (rad_diag_*) can exclude species or even an entire mode (or modes), I have relied on rad_cnst_* calls to get info about mode numbers, number of species in a mode and mode (or species) properties. rad_cnst_* calls guarantee that only species/modes present in the rad_diag_* lists are accessed. I have tested the following cases: Excluding all aerosols Excluding BC from all modes Excluding SOA from all modes Excluding Aitken mode Excluding Accumulation modes Radiation diagnostic list exactly the same as radiation climate list For diagnostic lists, "Aerosol size adjustment" process is always ON but "Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken or Accumulation mode is absent. I have reworked the mapping so that missing species in aitken and accumulation modes are accounted for. The modal_aero_calcsize_sub subroutine is heavily refactored where different processes are refactored into their own routines (for readability) and similar calculations are combined together. This code also fixes the memory leak issue mentioned in #2575. It also fixes another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for finding this!). This PR also cleans up logic for clear_rh variables following Ben Hillman's suggestions. Fixes #2575 Fixes #3468 [BFB] (for prognostic radiation calls, the answers will change for the diagnostic calls as this PR fixes a bug identified in issue #3468)
…1 into next (PR #3932) Fixes and enables radiation diagnostics Radiation diagnostic calls are enabled where aerosol species mentioned in radiation diagnostics list (rad_diag_* in atm_in file) can participate in all the same processes as prognostic radiation calls (mentioned in rad_climate list in atm_in). The missing processes for the diagnostic calls were: Aerosol size adjustment Aitken<->Accumulation aerosol transfer Enabling these calculations ensure that radiation diagnostic call with exactly the same species as radiation climate call, produces BFB diagnostic fields (issue #3468). Since, radiation diagnostic lists (rad_diag_*) can exclude species or even an entire mode (or modes), I have relied on rad_cnst_* calls to get info about mode numbers, number of species in a mode and mode (or species) properties. rad_cnst_* calls guarantee that only species/modes present in the rad_diag_* lists are accessed. I have tested the following cases: Excluding all aerosols Excluding BC from all modes Excluding SOA from all modes Excluding Aitken mode Excluding Accumulation modes Radiation diagnostic list exactly the same as radiation climate list For diagnostic lists, "Aerosol size adjustment" process is always ON but "Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken or Accumulation mode is absent. I have reworked the mapping so that missing species in aitken and accumulation modes are accounted for. The modal_aero_calcsize_sub subroutine is heavily refactored where different processes are refactored into their own routines (for readability) and similar calculations are combined together. This code also fixes the memory leak issue mentioned in #2575. It also fixes another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for finding this!). This PR also cleans up logic for clear_rh variables following Ben Hillman's suggestions. Fixes #2575 Fixes #3468 [BFB] (for prognostic radiation calls, the answers will change for the diagnostic calls as this PR fixes a bug identified in issue #3468)
…1 into next (PR #3932) Fixes and enables radiation diagnostics Radiation diagnostic calls are enabled where aerosol species mentioned in radiation diagnostics list (rad_diag_* in atm_in file) can participate in all the same processes as prognostic radiation calls (mentioned in rad_climate list in atm_in). The missing processes for the diagnostic calls were: Aerosol size adjustment Aitken<->Accumulation aerosol transfer Enabling these calculations ensure that radiation diagnostic call with exactly the same species as radiation climate call, produces BFB diagnostic fields (issue #3468). Since, radiation diagnostic lists (rad_diag_*) can exclude species or even an entire mode (or modes), I have relied on rad_cnst_* calls to get info about mode numbers, number of species in a mode and mode (or species) properties. rad_cnst_* calls guarantee that only species/modes present in the rad_diag_* lists are accessed. I have tested the following cases: Excluding all aerosols Excluding BC from all modes Excluding SOA from all modes Excluding Aitken mode Excluding Accumulation modes Radiation diagnostic list exactly the same as radiation climate list For diagnostic lists, "Aerosol size adjustment" process is always ON but "Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken or Accumulation mode is absent. I have reworked the mapping so that missing species in aitken and accumulation modes are accounted for. The modal_aero_calcsize_sub subroutine is heavily refactored where different processes are refactored into their own routines (for readability) and similar calculations are combined together. This code also fixes the memory leak issue mentioned in #2575. It also fixes another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for finding this!). This PR also cleans up logic for clear_rh variables following Ben Hillman's suggestions. Fixes #2575 Fixes #3468 [BFB] (for prognostic radiation calls, the answers will change for the diagnostic calls as this PR fixes a bug identified in issue #3468)
…1 into next (PR #3932) Fixes and enables radiation diagnostics Radiation diagnostic calls are enabled where aerosol species mentioned in radiation diagnostics list (rad_diag_* in atm_in file) can participate in all the same processes as prognostic radiation calls (mentioned in rad_climate list in atm_in). The missing processes for the diagnostic calls were: Aerosol size adjustment Aitken<->Accumulation aerosol transfer Enabling these calculations ensure that radiation diagnostic call with exactly the same species as radiation climate call, produces BFB diagnostic fields (issue #3468). Since, radiation diagnostic lists (rad_diag_*) can exclude species or even an entire mode (or modes), I have relied on rad_cnst_* calls to get info about mode numbers, number of species in a mode and mode (or species) properties. rad_cnst_* calls guarantee that only species/modes present in the rad_diag_* lists are accessed. I have tested the following cases: Excluding all aerosols Excluding BC from all modes Excluding SOA from all modes Excluding Aitken mode Excluding Accumulation modes Radiation diagnostic list exactly the same as radiation climate list For diagnostic lists, "Aerosol size adjustment" process is always ON but "Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken or Accumulation mode is absent. I have reworked the mapping so that missing species in aitken and accumulation modes are accounted for. The modal_aero_calcsize_sub subroutine is heavily refactored where different processes are refactored into their own routines (for readability) and similar calculations are combined together. This code also fixes the memory leak issue mentioned in #2575. It also fixes another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for finding this!). This PR also cleans up logic for clear_rh variables following Ben Hillman's suggestions. Fixes #2575 Fixes #3468 [BFB] (for prognostic radiation calls, the answers will change for the diagnostic calls as this PR fixes a bug identified in issue #3468)
…1 into next (PR #3932) Fixes and enables radiation diagnostics Radiation diagnostic calls are enabled where aerosol species mentioned in radiation diagnostics list (rad_diag_* in atm_in file) can participate in all the same processes as prognostic radiation calls (mentioned in rad_climate list in atm_in). The missing processes for the diagnostic calls were: Aerosol size adjustment Aitken<->Accumulation aerosol transfer Enabling these calculations ensure that radiation diagnostic call with exactly the same species as radiation climate call, produces BFB diagnostic fields (issue #3468). Since, radiation diagnostic lists (rad_diag_*) can exclude species or even an entire mode (or modes), I have relied on rad_cnst_* calls to get info about mode numbers, number of species in a mode and mode (or species) properties. rad_cnst_* calls guarantee that only species/modes present in the rad_diag_* lists are accessed. I have tested the following cases: Excluding all aerosols Excluding BC from all modes Excluding SOA from all modes Excluding Aitken mode Excluding Accumulation modes Radiation diagnostic list exactly the same as radiation climate list For diagnostic lists, "Aerosol size adjustment" process is always ON but "Aiken<->Accumulation transfer" is turned off for diagnostic calls where Aitken or Accumulation mode is absent. I have reworked the mapping so that missing species in aitken and accumulation modes are accounted for. The modal_aero_calcsize_sub subroutine is heavily refactored where different processes are refactored into their own routines (for readability) and similar calculations are combined together. This code also fixes the memory leak issue mentioned in #2575. It also fixes another memory leak recently introduced by PR #3885 (thanks for Andrew Bradley for finding this!). This PR also cleans up logic for clear_rh variables following Ben Hillman's suggestions. Fixes #2575 Fixes #3468 [BFB] (for prognostic radiation calls, the answers will change for the diagnostic calls as this PR fixes a bug identified in issue #3468)
…gate_racecheck_fail Automatically Merged using E3SM Pull Request AutoTester PR Title: Add team_barriers to water path tests PR Author: tcclevenger PR LABELS: AT: AUTOMERGE, bugfix
I have spent more than 2 weeks trying to figure out a model crash. Finally I was able to identify that the crash can be reproduced with current master without any code modification. I just checked out the latest master, made no change to the code, and enabled 10 radiation diagnostics for a run. The model ran for 1 month, wrote out h0 files, and ran a few more days and crashed with an error message stating “out of memory”. Could somebody please help?
The text was updated successfully, but these errors were encountered: