Skip to content

Seg faults in BTL atomics #1209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sjeaugey opened this issue Dec 11, 2015 · 6 comments
Closed

Seg faults in BTL atomics #1209

sjeaugey opened this issue Dec 11, 2015 · 6 comments
Assignees
Labels
Milestone

Comments

@sjeaugey
Copy link
Member

Many tests seem to segfault with new atomics on IB. I can reproduce it with 4 ranks (2x2) but not with only 2.

Here is the backtrace from the c_accumulate test :

#0  0x00007f3e64a4e5d0 in ?? ()
#1  <signal handler called>
#2  0x00007f3e66f08cd6 in mca_btl_openib_atomic_internal (btl=0x14fc960, endpoint=0x159f190, local_address=0x7f3e6b3261b0, remote_address=140702086062144, local_handle=0x0, remote_handle=0x7f3e6b326310, 
    opcode=IBV_WR_ATOMIC_FETCH_AND_ADD, operand=-9223372036854775808, operand2=0, flags=0, order=255, cbfunc=0x7f3e64032318 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fff28f2c9e7, cbdata=0x0)
    at btl_openib_atomic.c:49
#3  0x00007f3e66f08f7d in mca_btl_openib_atomic_fop (btl=0x14fc960, endpoint=0x159f190, local_address=0x7f3e6b3261b0, remote_address=140702086062144, local_handle=0x0, remote_handle=0x7f3e6b326310, 
    op=MCA_BTL_ATOMIC_ADD, operand=9223372036854775808, flags=0, order=255, cbfunc=0x7f3e64032318 <ompi_osc_rdma_atomic_complete>, cbcontext=0x7fff28f2c9e7, cbdata=0x0) at btl_openib_atomic.c:128
#4  0x00007f3e64029bb3 in ompi_osc_rdma_lock_release_exclusive (module=0x1842f40, peer=0x15f3db0, offset=16) at osc_rdma_lock.h:294
#5  0x00007f3e6402a271 in ompi_osc_rdma_acc_put_complete (btl=0x14fc960, endpoint=0x159f190, local_address=0x7f3e6dc0c030, local_handle=0x18294a0, context=0x18b3c90, data=0x0, status=0)
    at osc_rdma_accumulate.c:109
#6  0x00007f3e66ef573b in handle_wc (device=0x14f39a0, cq=1, wc=0x7fff28f2cb60) at btl_openib_component.c:3492
#7  0x00007f3e66ef5f80 in poll_device (device=0x14f39a0, count=1) at btl_openib_component.c:3685
#8  0x00007f3e66ef63cd in progress_one_device (device=0x14f39a0) at btl_openib_component.c:3795
#9  0x00007f3e66ef6469 in btl_openib_component_progress () at btl_openib_component.c:3818
#10 0x00007f3e6cb5b4ef in opal_progress () at runtime/opal_progress.c:189
#11 0x00007f3e64025234 in ompi_osc_rdma_progress (module=0x1842f40) at osc_rdma.h:348
#12 0x00007f3e64025dc7 in ompi_osc_get_data_blocking (module=0x1842f40, endpoint=0x159f190, source_address=140702086062824, source_handle=0x7f3e6b326310, data=0x7fff28f2fd60, len=24) at osc_rdma_comm.c:77
#13 0x00007f3e640371e7 in ompi_osc_rdma_peer_setup (module=0x1842f40, peer=0x1887a20) at osc_rdma_peer.c:162
#14 0x00007f3e64037431 in ompi_osc_rdma_peer_lookup_internal (module=0x1842f40, peer_id=1) at osc_rdma_peer.c:241
#15 0x00007f3e64037651 in ompi_osc_rdma_peer_lookup (module=0x1842f40, peer_id=1) at osc_rdma_peer.c:265
#16 0x00007f3e64029021 in ompi_osc_rdma_module_peer (module=0x1842f40, peer_id=1) at osc_rdma.h:303
#17 0x00007f3e64029315 in ompi_osc_rdma_module_sync_lookup (module=0x1842f40, target=1, peer=0x7fff28f2ff98) at osc_rdma.h:447
#18 0x00007f3e6402cb52 in ompi_osc_rdma_accumulate (origin_addr=0x7fff28f300d4, origin_count=1, origin_datatype=0x602fa0, target_rank=1, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, 
    win=0x15ffcb0) at osc_rdma_accumulate.c:914
#19 0x00007f3e6d7b61ea in PMPI_Accumulate (origin_addr=0x7fff28f300d4, origin_count=1, origin_datatype=0x602fa0, target_rank=1, target_disp=0, target_count=1, target_datatype=0x602fa0, op=0x6037a0, 
    win=0x15ffcb0) at paccumulate.c:130
#20 0x00000000004010ca in main (argc=1, argv=0x7fff28f301d8) at c_accumulate.c:39
@sjeaugey sjeaugey changed the title Seg faults in BTL atomic Seg faults in BTL atomics Dec 11, 2015
@sjeaugey
Copy link
Member Author

Forgot to mention : only the last rank crashes (rank 3 with np=4 or rank 5 with np=6).

@hjelmn
Copy link
Member

hjelmn commented Dec 15, 2015

Looking at it now. If I can't reproduce on one of my IB machines I will try on psg.

@hjelmn
Copy link
Member

hjelmn commented Dec 15, 2015

I see the problem. Should have the fix ready later today.

@hjelmn hjelmn added the bug label Dec 15, 2015
@hjelmn hjelmn added this to the v2.0.0 milestone Dec 15, 2015
hjelmn added a commit to hjelmn/ompi-release that referenced this issue Dec 15, 2015
A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.

Fixes open-mpi/ompi#1209

(cherry picked from open-mpi/ompi@0de9445)

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn referenced this issue Dec 15, 2015
A previous commit updated the one-sided code to register the state
region only once. This created an issue when using the scratch lock
with fetching atomics. In this case on any rank that isn't local rank
0 the module->state_handle is NULL. This commit fixes the issue by
removing the scratch lock and using a fragment pointer instead.

Fixes #1290

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@sjeaugey
Copy link
Member Author

👍

@hppritcha
Copy link
Member

@sjeaugey can this issue be closed?

@sjeaugey
Copy link
Member Author

sjeaugey commented Feb 5, 2016

Yes. I thought it was already but only 1241 was. Sorry about that.

@sjeaugey sjeaugey closed this as completed Feb 5, 2016
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 19, 2016
While we weren't really hanging, it was taking a very, very long time…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants