-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UFSATM failure with Intel with debug flags #58
Comments
@rsdunlapiv I believe that is just the place where an un-physical temperature triggers the array index to go out of bounds (it is associated with the calculation of the saturation specific humidity.) |
Dusan merged an update to the ufs_public_release earlier today, part of the update was to address regression test failures in debug mode and to enable those tests for both 15p2 and 16beta as standard regression tests. These tests passed on Cheyenne with Intel and GNU and on Hera with Intel; they are based on the C96 configurations. See ufs-community/ufs-weather-model#25 and https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf. Many questions: resolution, setup (namelist etc.), initial conditions? Can you point me to the run directory on Cheyenne, please? |
Some of these questions are answered in the cime test name: SMS_D.C96.GFSv15p2.cheyenne_intel Initial conditions are 2019-09-09 00 |
I am using Which is the latest available. |
I get a similar error on stampede using v16beta at C96 resolution: |
On stampede using v15p2 at C96: |
I am at AMS this week and don't have much time to look into this. The easiest way forward imo is to compare the run directory (input files, namelist, ...) of your CIME setup to the ufs-weather-model regression test setup (which uses rt.sh to run and which completes successfully) on Cheyenne using the Intel compiler. I can point you to a directory containing a successful run if that helps. |
Please point me to a successful run with debug flags enabled and I will compare. |
Jim, see
These are C96 test cases as in your CIME setup, and both run to completion for a 6h forecast when the model is compiled with |
@pjpegion I am ready to enlist your help. Instructions for running the tests on cheyenne are here: |
@llpcarson @JulieSchramm Let's run this test on Cheyenne and use it to review (and update, if needed) the Weather Model User's Guide on the directory structure and lists of input/output files. Keep in mind that this run uses CIME, the WM UG should be relevant to those using CIME, as well as to those running the model in other ways. |
Did the comparison with the run directories that I gave @jedwards4b lead to any insight? I am not sure it makes sense to have more people try to run and debug this unless we understand why the regression tests in the ufs-weather-model run to completion in DEBUG mode while the CIME runs don't. |
I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to. I think that it does make sense to have @pjpegion and @ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure. |
Ok, thanks or the info. I am happy to take a look as well.
… On Jan 15, 2020, at 8:50 AM, jedwards4b ***@***.***> wrote:
I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.
I think that it does make sense to have @pjpegion <https://github.com/pjpegion> and @ligiabernardet <https://github.com/ligiabernardet> and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#58?email_source=notifications&email_token=AB5C2RIGFWHEHE67UTNH4M3Q54ICRA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJALY5Q#issuecomment-574667894>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RKA3YAA7X4D6QEQBUDQ54ICRANCNFSM4KGJYSIQ>.
|
@jedwards4b I'm following your instructions and I ran into two problems so far. and there is more info in /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/TestStatus.log |
@pjpegion This looks like a python version issue - what python are you using? |
I think that @uturuncoglu is using 2.7.13 and hasn't tested with python3 yet. I will fix, but if you could please try with default python on cheyenne. |
I see my python is defaulting to /glade/u/home/pegion/miniconda3/bin/python |
This release is only compatible with Python 2.7.x (also because CCPP works only with those versions). |
Python 2.7 is getting deprecated this year. Does it make sense to limit to
an unsupported version of Python? We are moving to Python 3 everywhere.
---------------------------------------------------------------
Arun Chawla
Chief
Engineering & Implementation Branch
Room 2083
National Center for Weather & Climate Prediction
5830 University Research Court
College Park, MD 20740
Phone : 301-683-3740
Mobile : 240-564-5675
Fax : 301-683-3703
------------------------------------------------------------
…On Wed, Jan 15, 2020 at 11:50 AM Dom Heinzeller ***@***.***> wrote:
This release is only compatible with Python 2.7.x (also because CCPP works
only with those versions).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#58?email_source=notifications&email_token=AL5NYI2G7Q4P2FW3C3U7EZDQ545FPA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJA76VI#issuecomment-574750549>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL5NYIZGVJ7URKY3IOTT7SLQ545FPANCNFSM4KGJYSIQ>
.
|
For the next anticipated release of the ufs (with SAR etc) later this year we will hopefully be able to support Python 3. I don't see a chance to rewrite the code to work with Python 3, and what is more it will probably take years for Python 2.7 to completely disappear from HPCs and standard OS installations. |
CIME is fully compatible with and tested with python 3.6 as well as python 2.7. The fv3_interface issue should be easy to fix. |
@jedwards4b The model nows builds and the run starts. But the model crashes in initialization, log file is |
Can someone point me to the job submission script for this job, please? Thanks ... |
@climbfuji I am running out of /glade/work/pegion/UFS/ufs-mrweather-app/cime/scripts |
Thanks, but I don't know how to find the actual job submission script (the file that contains the #PBS configuration entries and the mpiexec_mpt calls) from there. Maybe the CIME folks can help? We should always write/copy this job submission script into the run directory using a filename like job_card, because many developers who are used to rerun some of the stuff manually will want this. And it is also good for documentation purposes in my opinion. |
@pjpegion Now we are in the same place - I am trying to understand and fix this failure.
|
I I had to guess I would say initial conditions. It's the first time it is calling the saturation adjustment as part of the dynamics before doing any physics, i.e. right after reading the initial conditions. I am downloading the run dirs for my rt.sh run and your cime run to my laptop to take a closer look at the diffs. |
@climbfuji The job submission script is in the case directory: If you want to see what the script will submit you would run By default we will submit the chgres and then the model - if you only want to submit the model use |
@climbfuji please point me to your build log - I want to confirm that we are using the same flags to build ccpp. |
/glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-public-release-20200114/tests/log_cheyenne.intel/compile_2.log is the log for the debug tests (compile step) |
I did find a problem with the build and am working on it, but I don't think that it is related to this run failure and agree that there seems to be a problem with initial conditions. |
It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places: |
I was able to run to completion by using the initial conditions in This points to a problem in chgres or in the initial condition files themselves. I'm not sure where to go from here. @uturuncoglu @climbfuji |
Dusan had the same error a couple months ago. It was traced to ice concentrations greater than 1.0 (such as 1.0000000000004) in the initial surface file from chgres. A fix was added. Can you merge the latest chgres from 'develop' to your branch? |
@arunchawla-NOAA I have opened issue NOAA-EMC/NCEPLIBS#21 but I am not sure who to assign. |
We received the following error running UFSATM with Intel 19/MPT on Cheyenne with physics GFSv15p2. We are running in debug mode and a SIGFPE was caught inside the physics (see stack trace below).
CIME test:
SMS_D.C96.GFSv15p2.cheyenne_intel
Modules:
Hash of UFS weather model:
ufs-community/ufs-weather-model@bde62f9
We can provide more information, as needed, on the initial conditions.
Stack trace:
The text was updated successfully, but these errors were encountered: