Improve run time and stability #1

aidanheerdegen · 2018-11-23T03:51:30Z

The model is currently running below 2hr30m per month on normalbw on the raijin system at NCI, but not consistently. We would like to improve the speed to be able to run 2 months for every PBS submit which should improve throughput.

I have recompiled the MOM5 and CICE5 components on a Broadwell compute node so as to pick up the faster broadwell specific AVX2 instruction set. As the models use the -xHost flag during compilation this should happen automatically.

I have put the compiled executables here:

/short/public/aph502/ACCESS-OM2_broadwell

Ruth also asked what about when the model crashes and the timestep is reduced to get past the issue causing a problem, it would mean going back to 1 month/submit.

Obviously we would like the model not to crash.

Was there the possibility that we can place an upper limit on ice advection to prevent a crash?

Have we confirmed Ruth is using the latest and best bathymetry?

Can we change the CFL condition, or is this a hard wired problem of the ice traversing an entire grid cell?

Would it help to increase ndtd?

Have we turned off mushy thermo? If we haven't and do so now it might give enough head room on run time to increase ndtd.

Andrew replied to the questions (on a help ticket):

Thanks Aidan
Would it be better for this discussion to be a github issue? eg COSIMA/access-om2#78
With RYF I was getting "bad departure points" crashes on 12 Aug, probably the same as Ruth's.
From my notes:
Warning: Departure points out of bounds in remap
my_task, i, j = 1179 17 62
dt, uvel, vvel = 300.000000000000 0.569499526003574
3.71828483289453
dpx, dpy = -170.849857801072 -1115.48544986836
HTN(i,j), HTN(i+1,j) = 4208.11316624741 4214.00348097445
HTE(i,j), HTE(i,j+1) = 1030.78296224141 1030.90528998208
istep1, my_task, iblk = 274579 1179 1
Global block: 1180
Global i and j: 1726 2671
remap transport: bad departure points
forrtl: error (78): process killed (SIGTERM)
Location (i,j)=(1726,2671) is in the Canadian Archipelago, apparently southeast of Victoria Island.
Seems to occur when advective velocity is excessive, carrying ice more than one grid cell (line 1563, subroutine departure_points).
vvel = 3.71828483289453 (m/s presumably) - seems pretty fast
yields dpy(i,j) = -dt*vvel(i,j) = -1115.48544986836 < -HTE = -1030.
HTE = 1030m is a very small grid spacing for the ocean - this is near the tripole.
/g/data1/ua8/JRA55-do/RYF/v1-3/*.1984_1985.nc seems to show the passage of a low pressure system (~990hPa) on 11-12 Aug in that location, with strongish (~10m/s) winds swinging from southeasterly to southwesterly.

Was there the possibility that we can place an upper limit on ice advection to prevent a crash?

This was discussed here:
COSIMA/access-om2#78 (comment)
I'm not in favour of an unphysical velocity limit, particularly not the per-component limit currently in the code. It would mess up the velocity gradients, affecting divergence etc.
I also played around with increasing ocean-ice drag (dragio) but ended up using ndtd=3 instead to keep the physics consistent.

Have we confirmed Ruth is using the latest and best bathymetry?

She's not. Ruth is using
/short/public/access-om2/input_38570c62/mom_01deg/topog.nc
which is the same as my control RYF and IAF runs.
This is from 1 July and doesn't include fixes to the terraces in shallow water
COSIMA/access-om2#99 (comment)
There is also an even more recent bathymetry in which Nic removed seamounts north of Severny Island to improve MOM stability - discussion is on Slack.
Using either of these new bathymetries will require tweaking the restart files as new wet cells are added (at least in Nic's case, but probably both).

Can we change the CFL condition, or is this a hard wired problem of the ice traversing an entire grid cell?

AFAIK it's a fundamental limit of the advection algorithm, unlike the MOM CFL check, so we can't change it

Would it help to increase ndtd?

Yes, this is how we fixed it before. Ruth currently has ndtd=2, but the RYF control run uses 3. I've also been using ndtd=3 in the IAF run. But it will mess up the load balance as CICE will then run 1.5x slower. Actually I wonder whether this explains a lot of the efficiency of the minimal config...?

Have we turned off mushy thermo? If we haven't and do so now it might give enough head room on run time to increase ndtd.

No it's still mushy (ktherm=2) and we probably can't go back to ktherm=1. I tried this back in Aug without success (got "ice: Vertical thermo error"). I suspect this is caused by the freezing point being incompatible between the two schemes, but was unable to be sure as a whole lot of error messages weren't being written. Let me know if you want all the details. This is one of the reasons I'd like to start the new run from WOA.
cheers
Andrew

The text was updated successfully, but these errors were encountered:

nichannah · 2018-11-28T02:29:54Z

I'm in the process of helping Ruth to update to the new bathymetry.

I'll look at changing to ndtd=3 as the default in the minimal experiments.

aekiss · 2019-08-13T03:46:10Z

Can we close this issue?

The new configuration runs much faster, and more stably and consistently.

aekiss closed this as completed Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve run time and stability #1

Improve run time and stability #1

aidanheerdegen commented Nov 23, 2018

nichannah commented Nov 28, 2018 •

edited

Loading

aekiss commented Aug 13, 2019

Improve run time and stability #1

Improve run time and stability #1

Comments

aidanheerdegen commented Nov 23, 2018

nichannah commented Nov 28, 2018 • edited Loading

aekiss commented Aug 13, 2019

nichannah commented Nov 28, 2018 •

edited

Loading