Skip to content

Performance FLT

Peter Hjort Lauritzen edited this page Nov 6, 2023 · 14 revisions

FLT configuration

Attempts at running with se_nsplit=1 without increasing se_nu_top are failing in the month of June. Added vertical diffusion in physics similar to sponge layer. That runs stably.

Performance: FLT baseline cam6_3_132 (using 2160 tasks on Derecho)

Screenshot 2023-11-06 at 3 10 32 PM

Overall model runs ~14% faster or one more year for SYPD.

Region                       ref.    opt      Mean (s) 
========================================================
dyn_run                      9252   6366 
   prim_advance_exp          5282   4659  (horizontal dycore)
   prim_advec_tracers_fvm    2742   1364  (CSLAM advection)
   prim_advec_tracers_remap   588      0  (SE advection)
   vertical_remap             197     96  (vertical remap)

Dycore runs 30% faster. CSLAM advection 2x faster.

Modifications

se_nsplit = 1
se_rsplit = 6
se_qsplit = 1

se_hypervis_subcycle = 1
se_nu_div = 1.0E15
se_nu = 1E15
se_sponge_del4_nu_div_fac  = 1.0
se_sponge_del4_nu_fac  = 1
se_sponge_del4_lev = 3

and source code modifications:

/glade/p/cesmdata/cseg/runs/cesm2_0/f.cam6_3_132.FLTHIST_ne30.opt.001/SourceMods/src.cam

In particular in physics/vertical_diffusion.F90:

  kvm(:,1) = 2E5_r8
  kvm(:,2) = 1E5_r8
  kvm(:,3) = 0.25E5_r8

Stability estimates for this setup:

Estimates for maximum stable and actual time-steps for different aspects of algorithm:
(assume max wind is 120.00000000m/s)
(assume max gravity wave speed is 342m/s)
 
* dt_dyn        (time-stepping dycore  ; u,v,T,dM) <     356.00s     300.00s
* dt_dyn_vis    (hyperviscosity)       ; u,v,T,dM) <     339.77s     300.00s
* dt_tracer_se  (time-stepping tracers ; q       ) <     308.89s     300.00s
* dt_tracer_vis (hyperviscosity tracers; q       ) <     339.77s    300.00s
* dt_tracer_fvm (time-stepping tracers ; q       ) <    1853.31s    1800.00
* dt_remap (vertical remap dt)    1800.00
* dt    (del2 sponge           ; u,v,T,dM) <     650.54s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <    1653.12s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <    4236.65s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <   10114.09s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <   21734.16s    300.00s

older experiments

se_nsplit = 2
se_rsplit = 3
se_qsplit = 1
se_hypervis_subcycle = 1
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac  = 1
se_sponge_del4_nu_fac  = 1
se_sponge_del4_lev = 1
se_hypervis_subcycle_sponge = 3

using cam6_3_132 and sourceMod in prim_driver.F90

> #ifdef trunk
292c292,306
< 
---
> #else
>     if (nsplit/=1) then
>       call ApplyCAMForcing(elem,fvm,tl%n0,n0_qdp,dt_remap,dt_phys,nets,nete,nsubstep)
>     end if
>     call tot_energy_dyn(elem,fvm,nets,nete,tl%n0,n0_qdp,'dBD')    
>     do r=1,rsplit
>       if (r.ne.1) call TimeLevel_update(tl,"leapfrog")
>       !
>       ! if nsplit==1 and physics time-step is long then there will be noise in the
>       ! pressure field; hence "dripple" in tendencies
>       !
>       if (nsplit==1) call ApplyCAMForcing(elem,fvm,tl%n0,n0_qdp,dt,dt_phys,nets,nete,r)
>       call prim_step(elem, fvm, hybrid,nets,nete, dt, tl, hvcoord,r)
>     enddo
> #endif
557c571
<       call Prim_Advec_Tracers_remap(elem, deriv,hvcoord,hybridnew,dt_q,tl,nets,nete)
---
> !      call Prim_Advec_Tracers_remap(elem, deriv,hvcoord,hybridnew,dt_q,tl,nets,nete)

With some I/O the model runs 52s/day compared to baseline 68s/day on Derecho (~26% speed-up).

OMEGA500 1 month average (month 3 in run) has much more structure in the speed-up version (left optimized; right baseline):

Screen Shot 2023-10-18 at 4 13 30 PM

se_hypervis_subcycle = 3
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac  = 3
se_sponge_del4_nu_fac  = 3
se_sponge_del4_lev = 3
se_hypervis_subcycle_sponge = 3

unstable

se_nsplit = 1
se_rsplit = 6
se_qsplit = 1

se_hypervis_subcycle = 6
se_nu_div = 1E15
se_nu = 1E15
se_sponge_del4_nu_div_fac  = 7.5
se_sponge_del4_nu_fac  = 5
se_sponge_del4_lev = 3
se_hypervis_subcycle_sponge = 1

stable

Performance of out-of-the-box

Here is a break down of the dynamical core timings (normalized by total dynamical core timing):

Screen Shot 2023-07-18 at 2 55 18 PM

Same but not normalized:

Screen Shot 2023-07-18 at 3 08 52 PM

Almost the same time is spent in advancing dynamical core (prim_advance_exp) as in tracer advection (number of advected species is 41).

prim_advance_exp is 50-50ish split between inviscid solver and hyperdifussion:

           prim_advance_exp                                             900    900    36       0.9670      0.5139      300     1.4304      136    
              compute_and_apply_rhs                                      900    900    180      0.5040      0.2284      754     0.9490      63     
              advance_hypervis                                           900    900    36       0.4550      0.2523      272     0.6860      148    
                sponge_diff                                              900    900    36       0.0283      0.0051      743     0.0926      726    

Default dynamics namelist settings are:

 se_hypervis_subcycle           =  4 
 se_hypervis_subcycle_q         =  1                 
 se_hypervis_subcycle_sponge    =  1  
 se_large_courant_incr          =  .true.  
 se_limiter_option              =  8
 se_nsplit                      =  2
 se_rsplit                      =  3 
 se_nu_top                      =  1.25e5 

->

dt_remap = 1800/se_nsplit = 900
dt_fvm   = 1800/se_nsplit = 900
dt_dyn   = dt_remap/se_rsplit = 300
dt_hypervis = dt_dyn/se_hypervis_subcycle = 75

Possible optimization

There is increased del4 divergence damping in the sponge:

  sponge_del4_nu_fac     =  0.10E+01q
  sponge_del4_nu_div_fac =  0.45E+01
  sponge_del4_lev        = 3

which may not be necessary. By setting

  sponge_del4_nu_div_fac =  0.10E+01

we can reduce se_hypervis_subcycle from 4 to 3. If we decrease se_nu_div from 2.5E15 to 2.0E15 we can reduce se_hypervis_subcycle to 2.

Screen Shot 2023-07-18 at 4 40 42 PM

This leads to slight speed-up (5% speed-up of dycore; cam_run3).

With HB diff it is now possible to run with se_nsplit=1. Changing to nsplit=1 in the optimized se_hypervis_subcycle setup (described above) we get a ~35% speed-up of the dynamical core:

Screen Shot 2023-07-19 at 11 22 24 AM

with the following stability estimates:

Estimates for maximum stable and actual time-steps for different aspects of algorithm:
(assume max wind is 120.00000000m/s)
(assume max gravity wave speed is 342m/s)
 
* dt_dyn        (time-stepping dycore  ; u,v,T,dM) <     356.00s     300.00s
* dt_dyn_vis    (hyperviscosity)       ; u,v,T,dM) <     169.89s     150.00s
* dt_tracer_se  (time-stepping tracers ; q       ) <     308.89s     300.00s
* dt_tracer_vis (hyperviscosity tracers; q       ) <     339.77s    300.00s
* dt_tracer_fvm (time-stepping tracers ; q       ) <    1853.31s    1800.00
* dt_remap (vertical remap dt)    1800.00
* dt    (del2 sponge           ; u,v,T,dM) <     650.54s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <    1653.12s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <    4236.65s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <   10114.09s    300.00s
* dt    (del2 sponge           ; u,v,T,dM) <   21734.16s    300.00s

It does mean that if v>60m/s then CSLAM will use a PCoM for the increment for winds above 60m/s which is less accurate.

1 month "noise" testing

Run 1: FLT out of the box; 1 month average OMEGA500 (no spin-up):

Screen Shot 2023-07-19 at 3 13 12 PM

nsplit=1 and applyCAMforcing same number of times as default

se_nsplit = 1
se_rsplit = 6
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5

and applyCAMforcing twice during dynamics (same number of times as default but here in floating layer):

Screen Shot 2023-07-19 at 3 15 20 PM min/max OMEGA500: -0.64,0.46

Run 2: nsplit=1 and applyCAMforcing rsplit number of times

se_nsplit = 1
se_rsplit = 6
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5

and applyCAMforcing rsplit number of times

Screen Shot 2023-07-19 at 3 20 13 PM

min/max OMEGA500: -0.41,0.44

Note: noise goes away!

Run 3: nsplit=1, applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=2E15 (instead of 2.5E15)

Like previous configuration but reduce cost of hyperviscosity by not increased div4 in sponge and slightly reduce nu_div everywhere:

se_nsplit = 1
se_rsplit = 6
se_sponge_del4_nu_div_fac  = 1
se_hypervis_subcycle = 2

se_nu_div = 2.0E15
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5

Screen Shot 2023-07-19 at 3 26 42 PM

min/max OMEGA500: -0.5,0.43

Run 4: nsplit=1 and applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=1E15 (instead of 2.5E15)

se_nsplit = 1
se_rsplit = 6
se_sponge_del4_nu_div_fac  = 1
se_hypervis_subcycle = 1

se_nu_div = 1.0E15
se_hypervis_subcycle_sponge = 2
se_nu_top = 5E5

Screen Shot 2023-07-19 at 3 36 49 PM min/max OMEGA500: -0.41,0.41

Run 5: nsplit=1 and applyCAMforcing rsplit number of times, no increased div4 in sponge, nu_div=1E15 (instead of 2.5E15), nu=1E15 (instad of 0.5E15)

Screen Shot 2023-07-19 at 4 26 39 PM

min/max OMEGA500: -0.53,0.47

We need to check if these optimizations carry over to FMT?

#Computational performance

Performance comparison between run 1 and run 4 (reduction in runtime is ~33%):

Screen Shot 2023-07-19 at 4 05 14 PM

Performance comparison between run 3 and run 4

Screen Shot 2023-07-19 at 4 10 01 PM

Performance for Run 5 + keeping water tracers (except Q) fixed using hacked code (same array size but only advection and hypervis on Q; boundary exchange still full Q):

Screen Shot 2023-07-19 at 5 01 17 PM

Vertical remap is not really sped-up (even through we don't do vertical remapping of GLL water tracers except Q)

Consider clean implementation:

  • hypervis buffer should only be one
  • sum over fixed species does not need to be re-computed

Other questions:

  • why is p_d_coupling so expensive (1.7s compared to 4.84s for the entire dynamical core):
           stepon_run2                                                      225    225    4        0.9117      0.2990      188     1.7555      223    
              p_d_coupling                                                   225    225    4        0.9116      0.2989      188     1.7555      223    
                phys2dyn                                                     225    225    4        0.5883      0.2131      188     1.5995      222    
                  fvm:fill_halo_phys                                         225    225    4        0.4123      0.0392      188     1.4356      222    
                p_d_coupling:bndry_exchange                                  225    225    4        0.2484      0.0097      189     0.8556      52     
                pd_copy                                                      225    225    4        0.0261      0.0202      218     0.0308      1      
            ionosphere_run2                                                  225    225    4        0.0000      0.0000      26      0.0000      73     
          CAM_run3                                                           225    225    4        4.0004      3.5746      175     4.8436      19     
            stepon_run3                                                      225    225    4        4.0004      3.5746      175     4.8436      19     
              dyn_run                                                        225    225    4        3.9385      3.5039      180     4.7863      19     
                prim_advec_tracers_fvm                                       225    225    4        1.6111      1.5291      169     1.6664      134    
                prim_advance_exp                                             225    225    24       1.4929      0.9390      180     2.4137      19     
                  compute_and_apply_rhs                                      225    225    120      1.1714      0.6649      180     2.0794      19     
                  advance_hypervis                                           225    225    24       0.2882      0.2400      36      0.4027      134    
                    sponge_diff                                              225    225    24       0.0702      0.0317      180     0.1216      164    
                prim_advec_tracers_remap                                     225    225    24       0.4930      0.3768      167     0.6851      180    
                vertical_remap                                               225    225    4        0.1331      0.1240      222     0.1523      81     
                applyCAMforcing                                              225    225    24       0.1080      0.0591      222     0.1563      73