-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IME broken by "prov/tcp: Store a reference to the active transmit with the ep" #7443
Comments
All IME rpc threads are polling and not getting information that work is to be done #0 0x00007ffff63360f7 in epoll_wait () from /lib64/libc.so.6 |
I guess locking related, switching to a single thread 'fixes' the issue. |
I'm not familiar with IME. What version of libfabric did you use? Can you describe how IME uses libfabric? What type of endpoints? Which data transfer calls (msg, rma, tagged)? In my testing MSG_ZEROCOPY didn't show any benefit. That was with MPI, but I also didn't spend any time trying to tune zero copy or figuring out why there was no performance gain. The commit you referenced is more of a code simplification patch than anything functional. There's a list of pending transmit operations. Prior to the commit, the provider would access the operation at the head of the transmit queue when sending data. The commit removes the entry at the head of the queue, tracks it using a pointer instead, and sends the data that way. One of the benefits is that all operations queued on the transmit list are as submitted. While the active operation may have had its headers byte swapped. |
Can you also please confirm that the performance results are consistent for the commits referenced? There shouldn't have been any impact on performance or locking with them. |
Let's ignore MSG_ZEROCOPY for now, I just hope that it will reduce memory bandwidth consumption, which is a limiting factor on an upcoming system. |
What version did you try switching to before bisecting? |
In the order of master, 1.14 1.13 (all 3 bad) and then to 1.12 (good) - bisected in between 1.12 and 1.13. |
As as a side node, the hanging part is with lots of ep polling threads, which we needed so far for good performance. I was trying to track down which line is causing the issue and kind of reverting back your changes - still ongoing and no success. reads from server to client: ==> Thanks for your work on MSG_ZEROCOPY! |
To be clear, you're seeing the performance drop with zerocopy disabled, correct? I've been going through that commit line by line as well and still not seeing anything that should cause an issue. Glad that zero copy is showing improvement for at least 1 app! |
One difference that I've found is that the new code replaces calls to tcpx_tx_pending() with ep->cur_tx_entry. However, that's not a direct replacement.
ep->cur_tx_entry is a suitable replacement for the slist_empty check, but not the bounce buffer check. |
Relative to 45dce66 I think the changes that were made to tcpx_try_func() need to be reverted, but that also tcpx_tx_pending() needs to be updated. However, I see that tcpx_tx_pending() is correct in the latest main, and tcpx_try_func() was also modified in main. So my thoughts on what's off with this commit don't carry over to the latest main. |
Can you please see if shefty@60d0e82 helps any? It looks like there's a bug in the tree at the commit point. But as I mentioned above, if this shows any benefit, I'm still at a lost as to why main didn't work. |
Yes, zerocopy disabled -> 25GB/s, can be tuned up 32GB/s with more threads and other pci tunings, but I'm sure I was then running into memory limitations - like adding more threads kept the corresponding cores busy (one core per thread) without adding more performance, and all busy in kernel copy functions.
I guess you need to have a system and setup where memory bandwidth is the limiting factor. |
I had actually already several things, as I wrote, I was kind of reverting things line by line, but I didn't get it working so far
|
This might it the problem:
Old code -- if the tx_queue is not empty, i.e. there are sends that are queued, signal the wait. The check is negated. I'm looking into the purpose of this wait signaling, but this section of code is the same in main, so if this is the problem it carries through the later releases. |
wait->signal was added by a4d80a0 |
Indeed that was it, thanks a lot! I also totally missed the negation. In fact, I had tried to comment out that code section, but it was still hanging (with a perf improvement, though) and then didn't further look... |
What I think is happening is this: A thread waits for completions on the CQ. At the time that the thread blocks, the socket has no outgoing transfers, so the thread calls select/poll looking for a POLLIN event. Another thread tries to send a large transfer. A portion of the data is accepted by the socket. The rest of the transfer is queued on the ep. That thread returns. Additional messages may be posted to the ep, but since there's already an active transfer, those messages simply get queued on the tx_queue. Note that the thread that was blocked on the CQ is still blocked waiting for a POLLIN event. However, it should be waiting for POLLOUT | POLLIN. The signal in this case isn't trying to drive progress, like the commit message suggested, but is needed to update the poll call. Thanks for confirming. I'll prepare a patch for upstream. |
I have pull requests for main, v1.14.x, and v1.13.x now open. |
We didn't update libfabric for quite some, but now have a system where memory bandwidth is an issue and I have some hope that MSG_ZEROCPY might help. Using recent versions has an issue with prov/tcp and bisecting leads to commit 45dce66
A simple IOR test reveals
commit cdd00d3
write 24419 3061.97 0.027907 10158080 8192 0.001211 35.64 0.111349 35.75 0
read 28029 3504.63 0.024385 10158080 8192 0.000610 31.14 0.008115 31.14 0
commit 45dce66
write 23091 3080.07 0.022109 1015808 8192 0.001081 3.54 0.236534 3.78 0
[0010] 2022-02-15 06:26:24 [ warn] ( ime_inode_imcl_rdbuf_flush_dirty_cb:00634) Some read requests aren't served by IME servers for long time (71 sec)
...
[0079] 2022-02-15 06:29:39 [ warn] ( ime_inode_imcl_rdbuf_flush_dirty_cb:00634) Some read requests aren't served by IME servers for long time (73 sec)
read 287.04 56.56 1.11 1015808 8192 0.000504 192.91 111.21 304.12 0
So writes from clients to servers work normally, but reads are very slow. It looks like there is spike of IO from time to time and then a long pause in between.
Now checking what this patch actually does...
The text was updated successfully, but these errors were encountered: