-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RT atm_ds2s_docn_dice failing on S4 and Jet #2385
Comments
Thanks for this issue, @InnocentSouopgui-NOAA. So this test is dependent on cpld_control_nowave_noaero_p8 and this test fails first (<...>/FV3_RT/rt_3364114/cpld_control_nowave_noaero_p8_intel/err). The PET000.ESMF_LogFile shows an error that "UFSDriver.F90:543 Not valid - No component mom6 found" I don't see where you have your ufs-weather-model code. May I check that you ran (something like):
This last command may have been skipped. |
My working copy is at /mnt/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/RT/weekly_20240729 I use |
I'm struggling to reproduce your error. Can you check if these steps also work for you on Jet
where my_rt.conf is
Tests pass for me, with |
It's possible that I don't think this is related to the atm_ds2s_docn_dice test though |
@InnocentSouopgui-NOAA I think I can close this. Please feel free to re-open or reach out if I can help, especially for this test. For adding this to global-workflow, I'll mention that there are several options to generate the cplhist files for input via CDEPS. Happy to discuss if you would like |
I followed those steps on both Jet and S4. |
Can you post any error information from S4? Were you able to run without the "-c" flag? |
Bellow are the error messages (out put of
Looking into run_atm_ds2s_docn_dice_intel.log, I see the following error.
If you can point me where to adapt for s4, I will work ion it. |
Ah, ok. So I don't have access to S4 but this line: This may not work, like on WCOSS2 something similar failed because of conflicting modules. |
Please keep the issue opened, Thanks again. |
Sure. Sorry for the trouble. fyi, we expect to remove the nco/ncrcat used there soon anyways |
On S4
|
Was cpld_control_nowave_noaero_p8 running on S4 before? Can you check if there is any more error information in FV3_RT/rt_334524/cpld_control_nowave_noaero_p8_intel/PET*.ESMF_LogFile or ufs-weather-model/tests/logs/logs_S4/ ? |
Yes Asfor the one, I am trying now, there is no error in PET*ESMF_Logfile or ufs-weather-model/tests/logs/logs_S4 ... FV3_RT/rt_334524/cpld_control_nowave_noaero_p8_intel/out
|
I'm confused...from 4 days ago I understood that cpld_control_nowave_noaero_p8_intel runs on S4 but atm_ds2s_docn_dice failed as nco/ncrcat didn't module load properly. Now, cpld_control_nowave_noaero_p8_intel compiles but does not run and you cannot find any errors. I would expect some error logged in a PET_{150-190}.ESMF_LogFile. Am I following correctly? It is difficult to help without error/log information |
It might be worth trying to run DEBUG build first. Add |
Yes, you are right, there is some confusion. I hope that clears a little bit the confusion. |
Thanks for looking deeper, @InnocentSouopgui-NOAA. Then, without more error information, would it be possible for you to run cpld_control_nowave_noaero_p8_intel on S4 with DEBUG as Dusan suggested from current ufs-weather-model/develop? |
I ran Possibly Optimization is not working well on S4 for this case. |
Great that both ran! Can I ask two questions first:
|
Yes, with the exact same clone, compiling
Yes I created baselines for all other tests in rt.conf last week. The failure of |
Can we try to dig into why the cpld_control_nowave_noaero_p8_intel (without DEBUG) is failing? Sorry that I don't have access to S4 to run tests myself. The MPI_Abort (on tasks 150-190) is usually done by ESMF. If so, there must be some log info printed in the PET files. Can you We should ID which component is failing. In run_dir/cpld_control_nowave_noaero_p8_intel/ufs.configure, you can find the component(s) for these 150-190 tasks/PETs |
I think it would also be useful to know exactly which component runs on the tasks which are showing the MPI abort. I'm guessing it is MOM6, but can you post your ufs.configure for this test? |
ufs.configure.txt |
Thanks for the ufs.configure. It looks like the abort is from a CICE or MOM6 PET. Is there any error information in the ice_diag log? fyi, I'm "unassigning" myself from this issue and adding a request for epic support as they are responsible for platform support. I can help as needed |
@NickSzapiro-NOAA, there is no |
I cloned ufs model today and all tests run successfully on S4. I was able to create all baselines in rt.conf. So this issue can be closed. |
Description
The Regression Test atm_ds2s_docn_dice is failing on S4 and Jet.
It fails to create the baseline.
To Reproduce:
This is happening with Intel.
I tried on S4 and Jet and it failed on both clusters. To reproduce the bug on one of those clusters,
./rt.sh -c -e -n "atm_ds2s_docn_dice intel" -a "ACCOUNT_NAME"
./rt.sh -c -e -a "ACCOUNT_NAME"
Additional context
Output
output logs
The text was updated successfully, but these errors were encountered: