-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sockets provider occasionally hangs #701
Comments
Tracing through the code, my guess is that the last write from the server gets lost, leaving the client spinning forever looking for a recv that never shows up. |
Code will hang in fi_cmatose as well, likely the same cause. |
What parameters you were using to run the msg_pingpong? I tried with the default parameters couple of times but didn't see this behavior. Does it happen with fi_cmatose all the time? |
Server: fi_msg_pingpong |
I believe that I've tracked this issue down. It's easier to hit this issue with larger transfers, but the problem exists for all transfer sizes, and likely applies to other types of data transfers. Example problem: The app issues a send and waits for it to complete. On reading the completion, the app exits. The issue is that the completion merely indicates that the send was queued locally, either through an underlying socket or an intermediate buffer. When the app exists, any queued data may be lost. On the remote side, the data that it expects never arrives, which results in a hang. |
I see! In that case our test example may not be perfect as it assumes the data is transferred properly. |
should the test be setting FI_REMOTE_COMPLETE? |
Remote complete is not supported by the verbs provider. |
multi_ep: Remove testing of udp provider
The sockets provider will hang running fi_msg_pingpong. The hang occurs on the client side and is frequent, but not every time. Sometimes the client will pause for several seconds before completing. But it has hung for at least multiple seconds. Long enough to lookup the PID, and attach a debugger to it. This was the backtrace from that time:
#0 0x00000030b56d7e33 in poll () from /lib64/libc.so.6
#1 0x00007f0e9460041d in fi_poll_fd (fd=,
#2 0x00007f0e9460405a in rbfdwait (cq=0x889880, buf=0x7fff1d772350, count=1,
#3 sock_cq_sreadfrom (cq=0x889880, buf=0x7fff1d772350, count=1, src_addr=0x0,
#4 0x00007f0e946044af in sock_cq_readfrom (cq=,
#5 0x0000000000401216 in fi_cq_read (size=)
#6 recv_xfer (size=) at simple/msg_pingpong.c:96
#7 0x00000000004014bc in run_test () at simple/msg_pingpong.c:147
#8 0x0000000000401841 in run (argc=,
#9 main (argc=, argv=)
For this test, the server side always seems to finish quickly.
The text was updated successfully, but these errors were encountered: