Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something is wrong - GEOSldas #1226

Closed
weiyuan-jiang opened this issue Dec 13, 2021 · 28 comments
Closed

Something is wrong - GEOSldas #1226

weiyuan-jiang opened this issue Dec 13, 2021 · 28 comments
Assignees
Labels
🪲 Bug Something isn't working

Comments

@weiyuan-jiang
Copy link
Contributor

With this latest components
env | (t) v3.7.0 (DH)
cmake | (t) v3.7.2 (DH)
ecbuild | (t) geos/v1.0.6 (DH)
GMAO_Shared | (t) v1.4.12 (DH)
MAPL | (b) develop (DH)
GEOSgcm_GridComp | (b) develop (DH)

GEOSldas crashes before finishing Setservice. But it can run with Debug flag..... Something is wrong but I am not sure what it is...

@weiyuan-jiang
Copy link
Contributor Author

It is this call that crashes GEOSldas .

call MAPL_WireComponent(GC, RC=STATUS)

keep looking....

@weiyuan-jiang weiyuan-jiang changed the title Something is wong Something is wrong Dec 13, 2021
@mathomp4 mathomp4 assigned mathomp4 and weiyuan-jiang and unassigned mathomp4 Dec 14, 2021
@mathomp4 mathomp4 added the 🪲 Bug Something isn't working label Dec 14, 2021
@mathomp4
Copy link
Member

I'll add in @tclune just so he knows. Never even heard of MAPL_WireComponent!

@mathomp4 mathomp4 changed the title Something is wrong Something is wrong - GEOSldas Dec 14, 2021
@mathomp4
Copy link
Member

Also adding @atrayano after seeing what's in WireComponent. That is...complex.

@tclune
Copy link
Collaborator

tclune commented Dec 14, 2021

Interesting: this is the section that I'm reverse engineering today. But had not planned to make any changes.

@tclune
Copy link
Collaborator

tclune commented Dec 14, 2021

Undoubtedly we will need to await wisdom from @atrayano , but my current understanding is that this procedure is the one that propagates imports and exports up from children. @weiyuan-jiang can you provide any more details about how it crashes?

@weiyuan-jiang
Copy link
Contributor Author

@tclune , No, I can't at this point. I spended whole day yesterday to figure that out but failed. I am keeping investigating...( Debug mode is successful)

@weiyuan-jiang
Copy link
Contributor Author

The error usually looks like
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

Sometimes ( not all the time), it reports a little bit more details
/discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//bin//GEOSldas.x(intel_avx_rep_memcpy+0x3c3) [0x859b83]
4 /discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//lib/libMAPL.generic.so(mapl_varconnpoint_mp_get_short_name
+0xcf) [0x2aaab8af8a9f]
5 /discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//lib/libMAPL.generic.so(mapl_varconn_mp_varisconnected_ie
+0x4bb) [0x2aaab8b08d4b]
6 /discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//lib/libMAPL.generic.so(mapl_genericmod_mp_mapl_wirecomponent_+0x28f9) [0x2aaab8b3eed9]
7 /discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//lib/libMAPL.generic.so(mapl_genericmod_mp_mapl_genericsetservices_+0x1b6f) [0x2aaab8b3831f]
8 /discover/nobackup/wjiang/LDAS_exp/assim/CURRENT_nbit/build//bin//GEOSldas.x() [0x44ac0f]

And
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
GEOSldas.x 000000000077E87A Unknown Unknown Unknown
libpthread-2.22.s 00002AAAC2BCDC10 Unknown Unknown Unknown
GEOSldas.x 0000000000859B83 Unknown Unknown Unknown
libMAPL.generic.s 00002AAAB8AF8A9F mapl_varconnpoint 39 VarConnPoint.F90
libMAPL.generic.s 00002AAAB8B08D4B mapl_varconn_mp_v 191 VarConn.F90
libMAPL.generic.s 00002AAAB8B3EED9 mapl_genericmod_m 7225 MAPL_Generic.F90
libMAPL.generic.s 00002AAAB8B3831F mapl_genericmod_m 623 MAPL_Generic.F90
GEOSldas.x 000000000044AC0F geos_ldasgridcomp 289 GEOS_LdasGridComp.F90

@tclune
Copy link
Collaborator

tclune commented Dec 14, 2021

Uh oh. I did make some nontrivial changes to VarConn a few weeks back. Worked with @atrayano to be sure we covered the necessary cases and the usual tests worked. But maybe LDAS is doing something unusual? We probably need to review the connectivities there. Looking for anything out of the ordinary.

@weiyuan-jiang
Copy link
Contributor Author

weiyuan-jiang commented Dec 15, 2021

I can confirm that MAPL v2.10.0 works but v2.11.0 crashes. Can we squeeze in a tag between them for GEOSldas to fix pFIO_ShaveMantissa.c ? @mathomp4 @tclune @gmao-rreichle

@weiyuan-jiang
Copy link
Contributor Author

I am not sure how long it takes to fix our develop branch. For a quick fix for GEOSldas, a tag v2.10.1 seems easy. But I am not sure if it breaks rules and brings more complications. This is the branch off v2.10.0 https://github.com/GEOS-ESM/MAPL/tree/patch/wjiang%2Fv2.10.1 . @mathomp4 .

@mathomp4
Copy link
Member

@weiyuan-jiang Yes. We can make one. Let me make sure your branch is good and then I'll issue one. I'll probably need to push a few updates to it to make it "official"

@mathomp4
Copy link
Member

mathomp4 commented Dec 15, 2021

Of course, we'll want @tclune's blessing. But until then, can you provide me a good CHANGELOG entry for this? I can then update your branch with it

ETA: Ah. Is this the bit-shaving-works-for-no-tiles fix:

Fixed

  • Return 0 when there is no data for bit shave

@weiyuan-jiang
Copy link
Contributor Author

Is that to_import = to_export intentional? @tclune
new code

FROM_EXPORT=SRC_ID, TO_IMPORT=TO_EXPORT, _RC)

old code

TO_EXPORT=TO_EXPORT, RC=STATUS )

@mathomp4
Copy link
Member

Is that to_import = to_export intentional? @tclune new code

FROM_EXPORT=SRC_ID, TO_IMPORT=TO_EXPORT, _RC)

old code

TO_EXPORT=TO_EXPORT, RC=STATUS )

@weiyuan-jiang Is that code in ShaveMantissa?

@weiyuan-jiang
Copy link
Contributor Author

No. That has nothing to do with ShaveMantissa. I am trying to compare Tom's new commit with his old commit on oomph

@weiyuan-jiang
Copy link
Contributor Author

because the name E2E seems export to export, but somehow there is import in it.

@tclune
Copy link
Collaborator

tclune commented Dec 15, 2021

Blessing given. (To patch release)

@tclune
Copy link
Collaborator

tclune commented Dec 15, 2021

E2E is a kludge itself, and I changed the kludge a bit to simplify the underlying logic and data types. Longer term the kludge evaporates.

@tclune
Copy link
Collaborator

tclune commented Dec 15, 2021

The E2E issue is almost certainly related the issues that @weiyuan-jiang is seeing - but we're still at a loss as to why the change would break LDAS but not the GCM.

@weiyuan-jiang
Copy link
Contributor Author

weiyuan-jiang commented Dec 16, 2021

This issue is addressed by PR #1237 .
Without the fix, GEOSldas nightly test crashes with 4 nodes (28 cores each) , but succeeds with 16 nodes( 7 cores each).
With the fix, GEOSldas nightly test succeeds with 4 nodes (28 cores each) . @tclune @atrayano

@weiyuan-jiang
Copy link
Contributor Author

Also I noticed the memory is not released and leaked forever.
Without fix (16 nodes), it reports
68.8% : 74.7% Mem Comm:Used

With fix, ( only 4 nodes), it reports
41.6% : 43.6% Mem Comm:Used

@weiyuan-jiang
Copy link
Contributor Author

If the PR #1237 fixes the issue, we probably don't need a new tag v2.10.1 any more @mathomp4 @gmao-rreichle

@mathomp4
Copy link
Member

@weiyuan-jiang Well, it exists. Who knows, we might have a reason for it in the future.

Did you test #1237 with GEOSgcm?

@weiyuan-jiang
Copy link
Contributor Author

@mathomp4 I tested it with GEOSldas and am confident it will work with GEOSgcm. There were memory leak in GEOSgcm too. It was just not as severe as GEOSldas's 24-member data assimilation run.

@gmao-rreichle
Copy link

Thanks, @weiyuan-jiang, for fixing this. I'm ok with using either MAPL v2.10.1 or v2.13.xx (to be created after merging this PR).
@mathomp4, @weiyuan-jiang: Please advise on which MAPL release you prefer that we use in the upcoming GEOSldas release, and which releases for ESMA_env, ESMA_cmake, and ecbuild we should use along with the chosen MAPL release.
If there are any changes from what's currently in GEOSldas, @biljanaorescanin should create and test a GEOSldas PR to get the updated version info into the "develop" branch of GEOSldas.
@mathomp4: @biljanaorescanin could run the GCM tests if needed.

@mathomp4
Copy link
Member

mathomp4 commented Dec 16, 2021

@gmao-rreichle @biljanaorescanin For the upcoming MAPL 2.14.0, you'll want:

  • ESMA_env v3.7.0
  • ESMA_cmake v3.7.3
  • MAPL v2.14.0

Essentially the latest of all three. (Technically you don't need the latest latest, but...might as well get them. All have bugfixes.) See below.

ETA: ecbuild hasn't changed in a looooong time. So whatever is there, use that!

@mathomp4
Copy link
Member

@gmao-rreichle @biljanaorescanin NOTE: For now you'll want to use env 3.7.0 and cmake 3.7.3.

I'm planning on releasing two new non-zero-diff versions of each today as @sdrabenh wants to make an NZD release of GEOS. So don't go straight to the latest version without realizing that.

We'll probably want @biljanaorescanin or @weiyuan-jiang to do tests with the new env and cmake to make sure the LDAS doesn't go wonky with them. The changes are ESMA_env is moving to Intel 2021.3 (which is NZD to 2021.2) and ESMA_cmake changes the Intel flags to make things more portable at NAS (see GEOS-ESM/ESMA_cmake#240)

@weiyuan-jiang
Copy link
Contributor Author

sloved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants