-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cacluated PIO stride and numtasks not correct for varying threading on Titan #1427
Comments
Is this a behaviour that has changed in the latest CIME? |
The old CIME did not have explicit lists of PIO strides and numtasks, correct? The calculation was done elsewhere. Perhaps this has always been broken. I'll run with an old version and check what the model log says was being done. |
Okay ... CIME 5.2 also had this list, but the stride was set at $PES_PER_NODE, so if a component was threaded, than this was one PIO task every few nodes, not every node. This was an "error" in the opposite direction than in CIME5.3, and there was never more than one PIO task per node (by default). I prefer this? I'll look at old versions next (if I can find anything). |
Thanks @worleyph ! It would really help if you can get the behaviour pre-CIME-5.2. |
CIME5.1 also had the default of
|
CIME2 did things differently - in one example stride was 4 even though this was two pio tasks for the 2-way threaded components and four pio tasks for the MPI-only OCN (on its own nodes). So, looks like the focus was on the number of pio tasks rather than on stride? In the "old days" I never trusted PIO strides and numiotasks to be set correctly, so always manually set something reasonable before running. The new logic generating a reasonable default is better, but I think that CIME5.3 needs to switch back to the CIME5.2 decision (use $PES_PER_NODE) or calculate the stride that guarantees no more than one pio task per node for all components, which is only an issue on Titan. @mt5555 also has (always) preferred a max number of PIO tasks, but that tends to be system and resolution specific. Note that pio_root being set to 1 needs to be evaluated. This was justified a long time ago (except that POP always required it to be 0). I haven't seen a study showing that this is worthwhile now (and potentially has issues when hyperthreading, and also ensures that the default timing files always miss the true I/O cost). |
This is an issue on Mira too. This line appears to be incorrect: https://github.com/ACME-Climate/ACME/blob/master/cime/scripts/lib/CIME/case_setup.py#L178. The stride should be coming from machine-specific settings, instead of $PES_PER_NODE: https://github.com/ACME-Climate/ACME/blob/master/cime/config/acme/machines/config_pio.xml#L25. |
Correct a budget diagnostic printed in the coupler log files. The fresh water budget was missing a freshwater term associated the salt flux due to ice melt/growth. This salt flux should be represented as an additional freshwater flux in the coupler water budget. The complete equation for net surface forcing of the POP model is PREC_F+EVAP_F+ROFF_F+IOFF_F+MELT_F+SALT_F*(sflux_factor/salinity_factor) + QFLUX/lh_fusion1e4s_scaling The freshwater representation of the SALT_F term was missing from the coupler budget. I've added that term and verified that the net ocean fresh water budget better represents the freshwater tendency calculated by the ocean component. I doubled checked the net budget against the global salinity trend in a model run. The new term in the table for the missing flux is weqsalt which stands for the fresh water equivalent of the salt flux. Test suite: cime regression tests and a coupled run where I backed out the freshwater flux from the salinity timeseries. Test baseline: cime_master Test namelist changes: Test status: [bit for bit, roundoff, climate changing] Fixes #1429 Code review: Tony, Rob, and Mariana.
On TItan, the PIO_STRIDE is now $PES_PER_NODE in all cases, same as in CIME5.2 and CIME5.1 . I'm okay with this, but will wait to hear from @amametjanov and @jayeshkrishna before closing this. Has something special been done for Mira? Or is there a more general fix in the pipe. Based on @amametjanov 's last comment, Az does not like using $PES_PER_NODE on Mira, at least. |
We just updated CIME on master so might be worth checking again. |
That is what I was testing - master updated with the latest CIME. |
Ok. |
Just looked at this on Cori-KNL, Cori-Haswell, and Edison. For --compset A_WCYCL1850S --res ne30_oECv3_ICG and the default pe layout: Cori-KNL has PIO_STRIDE == 64 |
Hi Pat. PIO_STRIDE stayed at 128 with the new CIME on Cetus. This is okay. The strides are being set from here, which is also setting Cori-KNL's stride to 64. |
Hi Az, Thanks. If things are working as they are supposed to and if you are happy with this, could you please close this issue? |
Thanks. |
I need to revisit pio stride default on cori-knl |
1 similar comment
I need to revisit pio stride default on cori-knl |
Correct a budget diagnostic printed in the coupler log files. The fresh water budget was missing a freshwater term associated the salt flux due to ice melt/growth. This salt flux should be represented as an additional freshwater flux in the coupler water budget. The complete equation for net surface forcing of the POP model is PREC_F+EVAP_F+ROFF_F+IOFF_F+MELT_F+SALT_F*(sflux_factor/salinity_factor) + QFLUX/lh_fusion1e4s_scaling The freshwater representation of the SALT_F term was missing from the coupler budget. I've added that term and verified that the net ocean fresh water budget better represents the freshwater tendency calculated by the ocean component. I doubled checked the net budget against the global salinity trend in a model run. The new term in the table for the missing flux is weqsalt which stands for the fresh water equivalent of the salt flux. Test suite: cime regression tests and a coupled run where I backed out the freshwater flux from the salinity timeseries. Test baseline: cime_master Test namelist changes: Test status: [bit for bit, roundoff, climate changing] Fixes #1429 Code review: Tony, Rob, and Mariana.
a) For env_mach_pes.xml with (among other components):
the PIO settings are
b) If this is instead changed to
then the PIO settings in env_run.xml are now
So, even though OCN is pure MPI in both cases, the stride is 16 in the first and 8 in the second. This is appropriate on systems at NERSC and the ALCF, but not on Titan.
The text was updated successfully, but these errors were encountered: