Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve run time and stability #1

Closed
aidanheerdegen opened this issue Nov 23, 2018 · 2 comments
Closed

Improve run time and stability #1

aidanheerdegen opened this issue Nov 23, 2018 · 2 comments

Comments

@aidanheerdegen
Copy link

The model is currently running below 2hr30m per month on normalbw on the raijin system at NCI, but not consistently. We would like to improve the speed to be able to run 2 months for every PBS submit which should improve throughput.

I have recompiled the MOM5 and CICE5 components on a Broadwell compute node so as to pick up the faster broadwell specific AVX2 instruction set. As the models use the -xHost flag during compilation this should happen automatically.

I have put the compiled executables here:

/short/public/aph502/ACCESS-OM2_broadwell

Ruth also asked what about when the model crashes and the timestep is reduced to get past the issue causing a problem, it would mean going back to 1 month/submit.

Obviously we would like the model not to crash.

Was there the possibility that we can place an upper limit on ice advection to prevent a crash?

Have we confirmed Ruth is using the latest and best bathymetry?

Can we change the CFL condition, or is this a hard wired problem of the ice traversing an entire grid cell?

Would it help to increase ndtd?

Have we turned off mushy thermo? If we haven't and do so now it might give enough head room on run time to increase ndtd.

Andrew replied to the questions (on a help ticket):

Thanks Aidan
Would it be better for this discussion to be a github issue? eg COSIMA/access-om2#78
With RYF I was getting "bad departure points" crashes on 12 Aug, probably the same as Ruth's.
From my notes:
Warning: Departure points out of bounds in remap
my_task, i, j = 1179 17 62
dt, uvel, vvel = 300.000000000000 0.569499526003574
3.71828483289453
dpx, dpy = -170.849857801072 -1115.48544986836
HTN(i,j), HTN(i+1,j) = 4208.11316624741 4214.00348097445
HTE(i,j), HTE(i,j+1) = 1030.78296224141 1030.90528998208
istep1, my_task, iblk = 274579 1179 1
Global block: 1180
Global i and j: 1726 2671
remap transport: bad departure points
forrtl: error (78): process killed (SIGTERM)
Location (i,j)=(1726,2671) is in the Canadian Archipelago, apparently southeast of Victoria Island.
Seems to occur when advective velocity is excessive, carrying ice more than one grid cell (line 1563, subroutine departure_points).
vvel = 3.71828483289453 (m/s presumably) - seems pretty fast
yields dpy(i,j) = -dt*vvel(i,j) = -1115.48544986836 < -HTE = -1030.
HTE = 1030m is a very small grid spacing for the ocean - this is near the tripole.
/g/data1/ua8/JRA55-do/RYF/v1-3/*.1984_1985.nc seems to show the passage of a low pressure system (~990hPa) on 11-12 Aug in that location, with strongish (~10m/s) winds swinging from southeasterly to southwesterly.

Was there the possibility that we can place an upper limit on ice advection to prevent a crash?

This was discussed here:
COSIMA/access-om2#78 (comment)
I'm not in favour of an unphysical velocity limit, particularly not the per-component limit currently in the code. It would mess up the velocity gradients, affecting divergence etc.
I also played around with increasing ocean-ice drag (dragio) but ended up using ndtd=3 instead to keep the physics consistent.

Have we confirmed Ruth is using the latest and best bathymetry?

She's not. Ruth is using
/short/public/access-om2/input_38570c62/mom_01deg/topog.nc
which is the same as my control RYF and IAF runs.
This is from 1 July and doesn't include fixes to the terraces in shallow water
COSIMA/access-om2#99 (comment)
There is also an even more recent bathymetry in which Nic removed seamounts north of Severny Island to improve MOM stability - discussion is on Slack.
Using either of these new bathymetries will require tweaking the restart files as new wet cells are added (at least in Nic's case, but probably both).

Can we change the CFL condition, or is this a hard wired problem of the ice traversing an entire grid cell?

AFAIK it's a fundamental limit of the advection algorithm, unlike the MOM CFL check, so we can't change it

Would it help to increase ndtd?

Yes, this is how we fixed it before. Ruth currently has ndtd=2, but the RYF control run uses 3. I've also been using ndtd=3 in the IAF run. But it will mess up the load balance as CICE will then run 1.5x slower. Actually I wonder whether this explains a lot of the efficiency of the minimal config...?

Have we turned off mushy thermo? If we haven't and do so now it might give enough head room on run time to increase ndtd.

No it's still mushy (ktherm=2) and we probably can't go back to ktherm=1. I tried this back in Aug without success (got "ice: Vertical thermo error"). I suspect this is caused by the freezing point being incompatible between the two schemes, but was unable to be sure as a whole lot of error messages weren't being written. Let me know if you want all the details. This is one of the reasons I'd like to start the new run from WOA.
cheers
Andrew

@nichannah
Copy link
Contributor

nichannah commented Nov 28, 2018

I'm in the process of helping Ruth to update to the new bathymetry.

I'll look at changing to ndtd=3 as the default in the minimal experiments.

@aekiss
Copy link
Contributor

aekiss commented Aug 13, 2019

Can we close this issue?

The new configuration runs much faster, and more stably and consistently.

@aekiss aekiss closed this as completed Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants