-
Notifications
You must be signed in to change notification settings - Fork 1
Performance FLT
Attempts at running with se_nsplit=1
without increasing se_nu_top
are failing in the month of June. Added vertical diffusion in physics similar to sponge layer. That runs stably.
Overall model runs ~14% faster or one more year for SYPD.
Region ref. opt Mean (s)
========================================================
dyn_run 9252 6366
prim_advance_exp 5282 4659 (horizontal dycore)
prim_advec_tracers_fvm 2742 1364 (CSLAM advection)
prim_advec_tracers_remap 588 0 (SE advection)
vertical_remap 197 96 (vertical remap)
Dycore runs 30% faster. CSLAM advection 2x faster.
se_nsplit = 1
se_rsplit = 6
se_qsplit = 1
se_hypervis_subcycle = 1
se_nu_div = 1.0E15
se_nu = 1E15
se_sponge_del4_nu_div_fac = 1.0
se_sponge_del4_nu_fac = 1
se_sponge_del4_lev = 3
and source code modifications:
/glade/p/cesmdata/cseg/runs/cesm2_0/f.cam6_3_132.FLTHIST_ne30.opt.001/SourceMods/src.cam
In particular in physics/vertical_diffusion.F90
:
kvm(:,1) = 2E5_r8
kvm(:,2) = 1E5_r8
kvm(:,3) = 0.25E5_r8
Stability estimates for this setup:
Estimates for maximum stable and actual time-steps for different aspects of algorithm:
(assume max wind is 120.00000000m/s)
(assume max gravity wave speed is 342m/s)
* dt_dyn (time-stepping dycore ; u,v,T,dM) < 356.00s 300.00s
* dt_dyn_vis (hyperviscosity) ; u,v,T,dM) < 339.77s 300.00s
* dt_tracer_se (time-stepping tracers ; q ) < 308.89s 300.00s
* dt_tracer_vis (hyperviscosity tracers; q ) < 339.77s 300.00s
* dt_tracer_fvm (time-stepping tracers ; q ) < 1853.31s 1800.00
* dt_remap (vertical remap dt) 1800.00
* dt (del2 sponge ; u,v,T,dM) < 650.54s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 1653.12s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 4236.65s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 10114.09s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 21734.16s 300.00s
se_nsplit = 2
se_rsplit = 3
se_qsplit = 1
se_hypervis_subcycle = 1
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac = 1
se_sponge_del4_nu_fac = 1
se_sponge_del4_lev = 1
se_hypervis_subcycle_sponge = 3
using cam6_3_132 and sourceMod in prim_driver.F90
> #ifdef trunk
292c292,306
<
---
> #else
> if (nsplit/=1) then
> call ApplyCAMForcing(elem,fvm,tl%n0,n0_qdp,dt_remap,dt_phys,nets,nete,nsubstep)
> end if
> call tot_energy_dyn(elem,fvm,nets,nete,tl%n0,n0_qdp,'dBD')
> do r=1,rsplit
> if (r.ne.1) call TimeLevel_update(tl,"leapfrog")
> !
> ! if nsplit==1 and physics time-step is long then there will be noise in the
> ! pressure field; hence "dripple" in tendencies
> !
> if (nsplit==1) call ApplyCAMForcing(elem,fvm,tl%n0,n0_qdp,dt,dt_phys,nets,nete,r)
> call prim_step(elem, fvm, hybrid,nets,nete, dt, tl, hvcoord,r)
> enddo
> #endif
557c571
< call Prim_Advec_Tracers_remap(elem, deriv,hvcoord,hybridnew,dt_q,tl,nets,nete)
---
> ! call Prim_Advec_Tracers_remap(elem, deriv,hvcoord,hybridnew,dt_q,tl,nets,nete)
With some I/O the model runs 52s/day compared to baseline 68s/day on Derecho (~26% speed-up).
OMEGA500 1 month average (month 3 in run) has much more structure in the speed-up version (left optimized; right baseline):
se_hypervis_subcycle = 3
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac = 3
se_sponge_del4_nu_fac = 3
se_sponge_del4_lev = 3
se_hypervis_subcycle_sponge = 3
unstable
se_nsplit = 1
se_rsplit = 6
se_qsplit = 1
se_hypervis_subcycle = 6
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac = 7.5
se_sponge_del4_nu_fac = 5
se_sponge_del4_lev = 3
se_hypervis_subcycle_sponge = 1
stable
Here is a break down of the dynamical core timings (normalized by total dynamical core timing):
Same but not normalized:
Almost the same time is spent in advancing dynamical core (prim_advance_exp) as in tracer advection (number of advected species is 41).
prim_advance_exp is 50-50ish split between inviscid solver and hyperdifussion:
prim_advance_exp 900 900 36 0.9670 0.5139 300 1.4304 136
compute_and_apply_rhs 900 900 180 0.5040 0.2284 754 0.9490 63
advance_hypervis 900 900 36 0.4550 0.2523 272 0.6860 148
sponge_diff 900 900 36 0.0283 0.0051 743 0.0926 726
Default dynamics namelist settings are:
se_hypervis_subcycle = 4
se_hypervis_subcycle_q = 1
se_hypervis_subcycle_sponge = 1
se_large_courant_incr = .true.
se_limiter_option = 8
se_nsplit = 2
se_rsplit = 3
se_nu_top = 1.25e5
->
dt_remap = 1800/se_nsplit = 900
dt_fvm = 1800/se_nsplit = 900
dt_dyn = dt_remap/se_rsplit = 300
dt_hypervis = dt_dyn/se_hypervis_subcycle = 75
There is increased del4 divergence damping in the sponge:
sponge_del4_nu_fac = 0.10E+01q
sponge_del4_nu_div_fac = 0.45E+01
sponge_del4_lev = 3
which may not be necessary. By setting
sponge_del4_nu_div_fac = 0.10E+01
we can reduce se_hypervis_subcycle from 4 to 3. If we decrease se_nu_div from 2.5E15 to 2.0E15 we can reduce se_hypervis_subcycle to 2.
This leads to slight speed-up (5% speed-up of dycore; cam_run3).
With HB diff it is now possible to run with se_nsplit=1. Changing to nsplit=1 in the optimized se_hypervis_subcycle setup (described above) we get a ~35% speed-up of the dynamical core:
with the following stability estimates:
Estimates for maximum stable and actual time-steps for different aspects of algorithm:
(assume max wind is 120.00000000m/s)
(assume max gravity wave speed is 342m/s)
* dt_dyn (time-stepping dycore ; u,v,T,dM) < 356.00s 300.00s
* dt_dyn_vis (hyperviscosity) ; u,v,T,dM) < 169.89s 150.00s
* dt_tracer_se (time-stepping tracers ; q ) < 308.89s 300.00s
* dt_tracer_vis (hyperviscosity tracers; q ) < 339.77s 300.00s
* dt_tracer_fvm (time-stepping tracers ; q ) < 1853.31s 1800.00
* dt_remap (vertical remap dt) 1800.00
* dt (del2 sponge ; u,v,T,dM) < 650.54s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 1653.12s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 4236.65s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 10114.09s 300.00s
* dt (del2 sponge ; u,v,T,dM) < 21734.16s 300.00s
It does mean that if v>60m/s then CSLAM will use a PCoM for the increment for winds above 60m/s which is less accurate.
se_nsplit = 1
se_rsplit = 6
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5
and applyCAMforcing twice during dynamics (same number of times as default but here in floating layer):
min/max OMEGA500: -0.64,0.46
se_nsplit = 1
se_rsplit = 6
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5
and applyCAMforcing rsplit number of times
min/max OMEGA500: -0.41,0.44
Note: noise goes away!
Run 3: nsplit=1, applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=2E15 (instead of 2.5E15)
Like previous configuration but reduce cost of hyperviscosity by not increased div4 in sponge and slightly reduce nu_div everywhere:
se_nsplit = 1
se_rsplit = 6
se_sponge_del4_nu_div_fac = 1
se_hypervis_subcycle = 2
se_nu_div = 2.0E15
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5
min/max OMEGA500: -0.5,0.43
Run 4: nsplit=1 and applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=1E15 (instead of 2.5E15)
se_nsplit = 1
se_rsplit = 6
se_sponge_del4_nu_div_fac = 1
se_hypervis_subcycle = 1
se_nu_div = 1.0E15
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5
min/max OMEGA500: -0.41,0.41
Run 5: nsplit=1 and applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=1E15 (instead of 2.5E15), nu=1E15 (instad of 0.5E15)
min/max OMEGA500: -0.53,0.47
We need to check if these optimizations carry over to FMT?
#Computational performance
Performance comparison between run 1 and run 4 (reduction in runtime is ~33%):
Performance comparison between run 3 and run 4
Performance for Run 5 + keeping water tracers (except Q) fixed using hacked code (same array size but only advection and hypervis on Q; boundary exchange still full Q):
Vertical remap is not really sped-up (even through we don't do vertical remapping of GLL water tracers except Q)
Consider clean implementation:
- hypervis buffer should only be one
- sum over fixed species does not need to be re-computed
Other questions:
- why is p_d_coupling so expensive (1.7s compared to 4.84s for the entire dynamical core):
stepon_run2 225 225 4 0.9117 0.2990 188 1.7555 223
p_d_coupling 225 225 4 0.9116 0.2989 188 1.7555 223
phys2dyn 225 225 4 0.5883 0.2131 188 1.5995 222
fvm:fill_halo_phys 225 225 4 0.4123 0.0392 188 1.4356 222
p_d_coupling:bndry_exchange 225 225 4 0.2484 0.0097 189 0.8556 52
pd_copy 225 225 4 0.0261 0.0202 218 0.0308 1
ionosphere_run2 225 225 4 0.0000 0.0000 26 0.0000 73
CAM_run3 225 225 4 4.0004 3.5746 175 4.8436 19
stepon_run3 225 225 4 4.0004 3.5746 175 4.8436 19
dyn_run 225 225 4 3.9385 3.5039 180 4.7863 19
prim_advec_tracers_fvm 225 225 4 1.6111 1.5291 169 1.6664 134
prim_advance_exp 225 225 24 1.4929 0.9390 180 2.4137 19
compute_and_apply_rhs 225 225 120 1.1714 0.6649 180 2.0794 19
advance_hypervis 225 225 24 0.2882 0.2400 36 0.4027 134
sponge_diff 225 225 24 0.0702 0.0317 180 0.1216 164
prim_advec_tracers_remap 225 225 24 0.4930 0.3768 167 0.6851 180
vertical_remap 225 225 4 0.1331 0.1240 222 0.1523 81
applyCAMforcing 225 225 24 0.1080 0.0591 222 0.1563 73