-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
floating overflow with T62_oQU240.GMPAS-IAF on cori-knl w debug intel #1309
Comments
Did they fail or time out? |
Looks like they aborted. I don't see any mention of them hitting time limit. |
can you point me at the case and run directories? |
and
|
What codebase are you using? Something isn't adding up... |
master from a few weeks ago. should i update? |
maybe so -- it looks like your codebase is inconsistent with the scripts that build the mpas namelists. The ocn logs have errors that look like: |
You could update to tag v1.0.0-beta.1 which is right before the cime5.2 update. |
I always update submodules. Note that this is working with 60to30, for what it's worth. |
I don't know -- it just confuses me that you're getting these error messages and I don't see any reference to that config flag anywhere in your codebase... So maybe try updating to the hash that Rob suggested and see if we still get this? |
I just encountered a similar error to what @ndkeen has seen here on edison in EC60to30v3 with cime5.2, but it doesn't seem to be crashing my run. |
The error with the config flag? |
yes. Here is the tail of ocn.log
|
I had a more recent master available (march 9th) and fired off another 1 node test. I get a slightly different error.
|
And just to clarify, if I add: I also tried turning off those glob stats for the march9th master, and I still get the same error as above. So 2 different errors I presume. |
@ndkeen, on the second error of your first post, using one node and two threads, you have an error:
This is here:
I bet the problem on that one is that
That would be a safe change to make regardless, and I can do that. |
Mark: just looking thru old issues. I guess we are just waiting for your change to propagate to ACME? |
@ndkeen - the changes should have propagated to ACME in May. Do you want to retest? Or close? |
Yes, the change is in ACME, and takes care of the |
The create newcase is:
create_newcase -case /global/cscratch1/sd/ndk/acme_scratch/SMS.T62_oQU240.GMPAS-IAF.cori-knl_intel.m27n01t02debugstats -res T62_oQU240 -mach cori-knl -compiler intel -compset GMPAS-IAF -project acme --walltime=00:30:00
The run completes 5 days in optimized builds (without DEBUG=TRUE) at about 16 SYPD.
For the following, I placed all components on the same node with 64 MPI's and adjusted PIO stride. This is with 2 threads.
I ran the same thing again with 1 thread and got a different failure:
The text was updated successfully, but these errors were encountered: