Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sockets provider occasionally hangs #701

Closed
shefty opened this issue Feb 26, 2015 · 9 comments
Closed

sockets provider occasionally hangs #701

shefty opened this issue Feb 26, 2015 · 9 comments
Milestone

Comments

@shefty
Copy link
Member

shefty commented Feb 26, 2015

The sockets provider will hang running fi_msg_pingpong. The hang occurs on the client side and is frequent, but not every time. Sometimes the client will pause for several seconds before completing. But it has hung for at least multiple seconds. Long enough to lookup the PID, and attach a debugger to it. This was the backtrace from that time:
#0 0x00000030b56d7e33 in poll () from /lib64/libc.so.6
#1 0x00007f0e9460041d in fi_poll_fd (fd=,

timeout=<value optimized out>) at src/common.c:98

#2 0x00007f0e9460405a in rbfdwait (cq=0x889880, buf=0x7fff1d772350, count=1,

src_addr=0x0, cond=<value optimized out>, timeout=<value optimized out>)
at ./include/fi_rbuf.h:282

#3 sock_cq_sreadfrom (cq=0x889880, buf=0x7fff1d772350, count=1, src_addr=0x0,

cond=<value optimized out>, timeout=<value optimized out>)
at prov/sockets/src/sock_cq.c:284

#4 0x00007f0e946044af in sock_cq_readfrom (cq=,

buf=<value optimized out>, count=<value optimized out>, 
src_addr=<value optimized out>) at prov/sockets/src/sock_cq.c:305

#5 0x0000000000401216 in fi_cq_read (size=)

at /usr/local/include/rdma/fi_eq.h:366

#6 recv_xfer (size=) at simple/msg_pingpong.c:96
#7 0x00000000004014bc in run_test () at simple/msg_pingpong.c:147
#8 0x0000000000401841 in run (argc=,

argv=<value optimized out>) at simple/msg_pingpong.c:516

#9 main (argc=, argv=)

at simple/msg_pingpong.c:572

For this test, the server side always seems to finish quickly.

@shefty
Copy link
Member Author

shefty commented Feb 26, 2015

@shantonu @jithinjosepkl

@shefty
Copy link
Member Author

shefty commented Feb 26, 2015

Tracing through the code, my guess is that the last write from the server gets lost, leaving the client spinning forever looking for a recv that never shows up.

@shefty
Copy link
Member Author

shefty commented Feb 27, 2015

Code will hang in fi_cmatose as well, likely the same cause.

@shantonu
Copy link
Contributor

What parameters you were using to run the msg_pingpong? I tried with the default parameters couple of times but didn't see this behavior. Does it happen with fi_cmatose all the time?

@shefty
Copy link
Member Author

shefty commented Feb 27, 2015

Server: fi_msg_pingpong
Client: fi_msg_pingpong 127.0.0.1

@shefty
Copy link
Member Author

shefty commented Mar 4, 2015

I believe that I've tracked this issue down. It's easier to hit this issue with larger transfers, but the problem exists for all transfer sizes, and likely applies to other types of data transfers.

Example problem: The app issues a send and waits for it to complete. On reading the completion, the app exits. The issue is that the completion merely indicates that the send was queued locally, either through an underlying socket or an intermediate buffer. When the app exists, any queued data may be lost. On the remote side, the data that it expects never arrives, which results in a hang.

@shantonu
Copy link
Contributor

shantonu commented Mar 5, 2015

I see! In that case our test example may not be perfect as it assumes the data is transferred properly.

@sayantansur
Copy link

should the test be setting FI_REMOTE_COMPLETE?

@shefty
Copy link
Member Author

shefty commented Mar 5, 2015

Remote complete is not supported by the verbs provider.

@shefty shefty added this to the release 1.0 milestone Mar 18, 2015
@shefty shefty closed this as completed Mar 23, 2015
shefty added a commit to shefty/libfabric that referenced this issue Aug 31, 2018
multi_ep: Remove testing of udp provider
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants