-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running GEOS CTM #15
Comments
I have the CTM running with this version of MAPL. You just need to add something like this into GEOSCTM.rc:
And for Dynamics:
Though, there is some duplication and it could probably be cleaned up and added to ctm_setup. |
Thank you. Very interesting. If it works, I will change the set up script... |
Do you have by chance an experiment directory on discover I can look at? My code is still crashing. Thanks. |
I do have an experiment located at: /discover/nobackup/kgerheis/experiments/ctm_test_experiment I think there might be a few other small things you need to change to get it to run. If you run into any errors I can probably quickly diagnose them as I've gone through this process several times. |
Kyle: Thank you. I will get back to you if I need more assistance. |
Kyle: It seems that I am still missing something as my code continues to crash. My working directory is: /gpfsm/dnb32/jkouatch/GEOS_CTM/GitRepos/testTR I believe that I have a rc setting issue that I cannot identify. I noticed that in my standard output file I have: In MAPL_Shmem: In MAPL_InitializeShmem (NodeRootsComm): That is not correct as there should be 4 nodes in use. |
You need to register the grid for the GridManager before you can create the grid by adding this to AdvCore_GridComp.F90. subroutine register_grid_and_regridders()
use MAPL_GridManagerMod, only: grid_manager
use CubedSphereGridFactoryMod, only: CubedSphereGridFactory
use MAPL_RegridderManagerMod, only: regridder_manager
use MAPL_RegridderSpecMod, only: REGRID_METHOD_BILINEAR
use LatLonToCubeRegridderMod
use CubeToLatLonRegridderMod
use CubeToCubeRegridderMod
type (CubedSphereGridFactory) :: factory
type (CubeToLatLonRegridder) :: cube_to_latlon_prototype
type (LatLonToCubeRegridder) :: latlon_to_cube_prototype
type (CubeToCubeRegridder) :: cube_to_cube_prototype
call grid_manager%add_prototype('Cubed-Sphere',factory)
associate (method => REGRID_METHOD_BILINEAR, mgr => regridder_manager)
call mgr%add_prototype('Cubed-Sphere', 'LatLon', method, cube_to_latlon_prototype)
call mgr%add_prototype('LatLon', 'Cubed-Sphere', method, latlon_to_cube_prototype)
call mgr%add_prototype('Cubed-Sphere', 'Cubed-Sphere', method, cube_to_cube_prototype)
end associate
end subroutine register_grid_and_regridders and calling it in AdvCore SetServices (line 318) if (.NOT. FV3_DynCoreIsRunning) then
call fv_init2(FV_Atm, dt, grids_on_my_pe, p_split)
call register_grid_and_regridders() ! add this line
end if This register_grid_and_regridders routine duplicates code in DynCore and should probably be added to its own module that AdvCore and DynCore can call. |
Kyle: That helps to go further. Thanks. The code still crashes. I am now running in a debugging mode... |
Jules, keep reporting the problems you are running into since I had to go through these issues as well when getting GCHP to work with the new MAPL. I may be able to help. |
One note of caution so that you do not make the same mistake I did. Regarding the earlier comment to add this to your config file:
I made the mistake of adding the prefix (in my case My fix was to change AdvCore_GridCompMod.F90 to expect lines with the prefix rather than without.
|
Lizzie, Thank you for your comments. I am wondering if it was necessary to make any change to my GEOSCTM.rc file as the AdvCore source coide has:
In any case, my code is still crashing and I still trying to figure out why. |
Something unusual that I mentioned before is the following print out at the beginning of the code: In MAPL_Shmem: In MAPL_InitializeShmem (NodeRootsComm): I went ahead in MAPL_ShmemMod.F90 and printed the name of all the processors (nodes) after the call: All the entries of the variable "names" only had the name of the head node. Something is wrong but I do not know what. I am sure that that there is a new rc setting I need to add. |
@JulesKouatchou This sounds suspiciously like calling MPT with ETA: For everyone, MPT does have an |
@JulesKouatchou Actually, I might need to work with you on the |
@mathomp4 That is something I had to change in ctm_run.j, and why I asked you about MPT's mpirun. I changed |
@mathomp4 |
Things are moving in the right directing. I now have as expected: In MAPL_Shmem: In MAPL_InitializeShmem (NodeRootsComm): The code is still crashing but this time in: GEOSctm.x 00000000048A9E47 fv_statemod_mp_co 3160 FV_StateMod.F90 It appears in the manipulations of U & V. My guess is that U & V are not properly read in or I am missing a rc setting somewhere. |
@JulesKouatchou Can you tell me which branch of GEOSgcm_GridComp is being used? Lines 3160 and 2902 don't seem right for this type of error. (I wanted to check on an uninitialized pointer that crops up in FV_StateMod from time to time.) |
@JulesKouatchou Also - worth learning how to include references to code-snippets in tickets. Easier to show you when you are here, but you can also find it by googling. As an example, I'll show the sections around the line numbes you mentioned above: |
@tclune I do not know how to include references to code-snippets of external components/modules of the CTM. I will try to figure it out. |
@JulesKouatchou Check this out: https://help.github.com/en/articles/creating-a-permanent-link-to-a-code-snippet. Thanks @tclune (I did not know about this). |
Note - this used to work better. Rather than a link you would actually see the lines of code in the ticket (and the email). I see that someone has raised the issue with GitHub. Started working wrongly (for some users) 2-3 weeks ago. |
Oh - it's because we are linking to text from a different repo. For the same repo it works really nicely. E.g. GEOSctm/src/Components/GEOSctm_GridComp/CTMdiffusion_GridComp/GmiDiffusionMethod_mod.F90 Lines 305 to 322 in 4b2ac7d
|
Very nice. Not sure if you guys use Slack, but it shows up nicely in Slack chats as well. |
In fv_computeMassFluxes_r8 you need to initialize !add these two lines to initialize to 0
uc = 0d0
vc = 0d0
uc(is:ie,js:je,:) = ucI
vc(is:ie,js:je,:) = vcI |
Kyle, Thank you for your inputs. I was able to run the CTM code for a day. I now need to run the code longer under various configurations. |
@kgerheiser I'm wary that these are not the correct points to do initializations. Was this solution found by trial-and-error, or pulled from some other version of the code? |
I found this bug several weeks ago and tracked it down to uninitialized halo values. The line |
I looked at the r4 version of the subroutine and it also assigns the variables to 0 at the same place |
It looks like someone at Harvard added this fix to the GCHP version of FV3 (r8) several years ago. Apologies if it never made it up the chain. I missed it as well when upgrading FV3 recently so will need to add it back in, although it hasn't caused a crash so far. |
I want to provide an update. The code does not run when using regular compilation options: Image PC Routine Line Source When I compile with debugging options, the code runs for 8 days and crashes: AGCM Date: 2010/02/08 Time: 16:00:00 Throughput(days/day)[Avg Tot Run]: 328.0 310.2 340.4 TimeRemaining(Est) 001:46:49 91.0% Memory Committed There seems to be a memory issue. Is there any setting I need to have? My code is at: /discover/nobackup/jkouatch/GEOS_CTM/GitRepos/GEOSctm and my experiment directory at: /discover/nobackup/jkouatch/GEOS_CTM/GitRepos/testTR |
@lizziel has also reported some memory leak issues. @bena-nasa has tried to replicate with a more synthetic use case, but was unsuccessful. I know that @wmputman has been using the advec core in his latest GCM development, but AFAIK he is using a slightly different version of FV than made it into Git. I.e., the git version of FV was really only vetted with the dycore and we are probably missing various minor fixes in the advec core that were only in CTM tags under CVS. Someone knowledgeable needs to take a hard look at the diffs between the current FV and working versions in CTM's under CVS. |
I can reproduce the crash in tp_core with debugging turned off |
I'm not sure what to make of this: I added a And it turns out |
We seem to be accumulating a sizable number of mysteries in the FV layer recently. Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement. I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind? |
There is the outstanding issue that the cat fvcore_layout.rc >> input.nml in the run fails on occasion…. This can produce inconsistent and undesirable effects in FV3.
…-- Bill Putman
William M Putman
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Cell: 240-778-5697
Desk: 301-286-2599
From: Tom Clune <notifications@github.com>
Reply-To: GEOS-ESM/GEOSctm <reply@reply.github.com>
Date: Wednesday, September 4, 2019 at 1:16 PM
To: GEOS-ESM/GEOSctm <GEOSctm@noreply.github.com>
Cc: William Putman <william.m.putman@nasa.gov>, Mention <mention@noreply.github.com>
Subject: [EXTERNAL] Re: [GEOS-ESM/GEOSctm] Running GEOS CTM (#15)
We seem to be accumulating a sizable number of mysteries in the FV layer recently.
Just yesterday Rahul made a change to the model entirely outside of FV, but it produced a runtime error in a write statement.
I certainly encourage you to use aggressive debugging options under both gfortran and Intel in the hopes that it exposes something. But if you've already done that ... Valgrind?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GEOS-2DESM_GEOSctm_issues_15-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAJ3CE6YUFCSDGC36K7UFKFTQH7UNFA5CNFSM4ILMJQY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54J2EY-23issuecomment-2D527998227&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=zc4YovTlH63o4UdZMtWga6uSnzvyqhzy3gPtffS099M&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AJ3CE66BJLMUP4HWB4DXV2DQH7UNFANCNFSM4ILMJQYQ&d=DwMCaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=z8dOdDWhHEg9eMypAGBSzN9nm21DGJm0jA6XVVJVjjw&m=I-yOkL6Q0iXQ8lB1MzEyXC4UJtxh9csLc4afvj8p3_E&s=rxXrbpxXeaUHZgub5JkLaY4zmu0zkPhgmnfTViys1g4&e=>.
|
Did you ever try using Rusty's |
That sounds like an issue in the scripting then. No matter what, we can never not do an append because the coupled model (@yvikhlya) actually has an I could belt-and-suspender it, with scripting. We could put in detection before |
I was able to run the CTM version on github jules pointed me to with a few modifications outlined in the comments here. THis appears to be based on Jason-3_0. I too was only able to run the debugged version but I can confirm that there does appear to be a memory leak. After 4 days the at c90 with the tracer case here is what I am seeing running from 21z on the 1st to 0z on the 5th of the month for the memory use on the root node of my compute session So apparently not using the little cfio does not make a difference? |
That's very encouraging on the memory front. Might need to up the priority for investigating the no-debug failure now. |
When I compile in a no-debug mode, the code crashes on Line 999 of:
It is: If I comment out the line, the code can run. I have not figured out yet why the code crashes in the subroutine "fillz" that is inside:
It seems that something is happening in the code segment (Line 87-92):
|
It crashes in fv_tracer2d? Based on your previous post, and from my testing it crashes in tp_core with a |
Here is what I am getting: MPT: #6 0x000000000173a21e in fv_fill_mod::fillz ( I read somewhere that one way to remove the "error reading variable: Cannot access memory" error is to change the compilation options. Perhaps it is why the code can run with debugging options turned on. |
Looks like it might be MPT related. I was building with Ifort and Intel MPI. Maybe that's why I don't get that bug. Right now I'm working on building with gfortran and OpenMPI in hopes that will expose some bugs. |
@JulesKouatchou Do you know what MPT environment variables are set with the CTM? It's possible one of the many we set for the GCM are needed for the CTM? |
Oh wow. I probably need the expertise of @tclune for my question with this code. So we have: if (flagstruct%fill) call fillz(ie-is+1, npz, 1, q1(:,j,:), dp2(:,j,:)) and now the fillz routine: subroutine fillz(im, km, nq, q, dp)
integer, intent(in):: im !< No. of longitudes
integer, intent(in):: km !< No. of levels
integer, intent(in):: nq !< Total number of tracers
real , intent(in):: dp(im,km) !< pressure thickness
real , intent(inout) :: q(im,km,nq) !< tracer mixing ratio Now q1 and dp2 are:
which means we are sending a weird 2d-slice of So does Fortran guarantee that |
Matt: I had the same concern. I created two temporary variables loc_q1(ie-is+1,npz,1) and loc_dp2(ie-is+1,npz). The code still crashed at the same location. |
My explanation was sort of off before. They are the same size which is the important thing. A temporary, automatic array, is created in the subroutine of the given size and the data is copied from your source array (which has the same number of elements) to the subroutine array and copied back when it returns. You're basically just interpreting the bounds differently. Though if they don't match, the code will still happily compile and run. |
@mathomp4 Unfortunately, this style is acceptable according to the standard. (But I'd much prefer not to fix F90 and F77 array styles.) As @kgerheiser explains, the compiler is forced to make copies due to the explicit shape of the dummy arguments. Without aggressive debugging flags, the sizes don't even have to agree. |
I want to add that when I run the stand alone DynAdvCores, I do not have an issue. My guess is that the fillz subroutine is not called. |
Runs fine in non-debug mode with Gfortran + OpenMPI. Maybe it's a compiler bug. So, we have: ifort + Intel MPI (debug): works (with a memory leak somewhere) ifort + Intel MPI (non-debug): divide by zero error in tp_core ifort + MPT: Some sort of memory access issue in fiilz Gfortran + OpenMPI: Works |
I've found that if you remove the entries associated with I have been using TotalView and its memory debugger to catch the problem, but it yields nothing useful. I can see that memory is corrupted (at least according to TotalView), but if you add a print in the code the values are fine. And due to the problem only being present when optimization is enabled when you step through the code it jumps around, so it's hard to see what's happening. |
I have found that both crashes are during the AVX instruction |
@kgerheiser I'm slowly getting caught up now on missed stuff. Is it only one file that needs |
I cloned GEOS CTM and was able to compile it. The ctm_setup script did not properly create the experiment directory because it was still referring to the old configuration (Linux/ instead of install/). I fix the ctm_setup file. The code is crashing during the initialization steps because it cannot create the grid. The code is failing on Line 9193 of MAPL_Generic.F90:
I can quickly understand why there is a problem: the label should only be 'GRIDNAME:'.
I checked a couple of CVS tags I have and could not locate any MAPL version similar to the one in the git repository. I am wondering if MAPL has to be updated before GEOS CTM can run.
The text was updated successfully, but these errors were encountered: