Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPL Develop runs crashing at NAS #2515

Closed
mathomp4 opened this issue Jan 2, 2024 · 4 comments
Closed

MAPL Develop runs crashing at NAS #2515

mathomp4 opened this issue Jan 2, 2024 · 4 comments
Assignees
Labels
🪲 Bugfix This fixes a bug!

Comments

@mathomp4
Copy link
Member

mathomp4 commented Jan 2, 2024

As I come back from the break, my first task is all ready to go. All the MAPL develop (and thus MAPL3) runs are dying in my nightly tests at NAS.

Well, not all. The mom6 runs seem to run. So that provides a clue.

My guess is this has to do with the fact we are now running with MPI_THREAD_MULTIPLE by default (via @aoloso PR). Should be fixable...I hope.

The error is seen after reading the History rc files and then it crashes:


 Reading HISTORY RC Files:
 -------------------------
 NOT using buffer I/O for file: HISTORY.rc
 NOT using buffer I/O for file: geosgcm_prog.rcx
 NOT using buffer I/O for file: geosgcm_surf.rcx
 NOT using buffer I/O for file: geosgcm_ocn.rcx
 NOT using buffer I/O for file: geosgcm_moist.rcx
 NOT using buffer I/O for file: geosgcm_turb.rcx
 NOT using buffer I/O for file: geosgcm_gwd.rcx
 NOT using buffer I/O for file: geosgcm_tend.rcx
 NOT using buffer I/O for file: geosgcm_budi.rcx
 NOT using buffer I/O for file: geosgcm_buda.rcx
 NOT using buffer I/O for file: geosgcm_landice.rcx
 NOT using buffer I/O for file: geosgcm_meltwtr.rcx
 NOT using buffer I/O for file: geosgcm_snowlayer.rcx
 NOT using buffer I/O for file: geosgcm_tracer.rcx
 NOT using buffer I/O for file: tavg2d_aer_x.rcx
 NOT using buffer I/O for file: tavg3d_aer_p.rcx
 NOT using buffer I/O for file: HISTORY.rc
...
MPT ERROR: Could not register RMA window with the HCA. There may not be
	enough memory.
MPT ERROR: Assertion failed at xp.c:188: "att != (void *)-1"
@mathomp4 mathomp4 added the 🪲 Bugfix This fixes a bug! label Jan 2, 2024
@mathomp4
Copy link
Member Author

mathomp4 commented Jan 2, 2024

What we know: MAPL 2.42.3 works, MAPL develop (as of 2024-01-02) does not.

Tests to be done:

  • Run with MAPL 2.42.4
  • Run with MAPL 2.43.0
  • Run with MAPL develop

See where the failure first occurs.

@bena-nasa
Copy link
Collaborator

bena-nasa commented Jan 2, 2024

I wonder if this is a single vs multiple node issue, the model (v11.4.0) with MAPL develop c24 at 2x12 ran just fine past History, c24 at 3x24 gave the same error as reported in the first post.

On the other hand I ran ExtDataDriver.x on multiple nodes and that ran, so I guess it is time to figure out what in the real History RC

@tclune
Copy link
Collaborator

tclune commented Jan 3, 2024

One possibility is that there is something in the new ESMF support for SSI and that this is breaking under MPT ...

@mathomp4
Copy link
Member Author

mathomp4 commented Jan 4, 2024

Well, @bena-nasa and I found a fix with MPT flags here:

GEOS-ESM/GEOSgcm_App#553

@mathomp4 mathomp4 closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bugfix This fixes a bug!
Projects
None yet
Development

No branches or pull requests

3 participants