Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IME broken by "prov/tcp: Store a reference to the active transmit with the ep" #7443

Closed
bsbernd opened this issue Feb 15, 2022 · 19 comments
Closed

Comments

@bsbernd
Copy link

bsbernd commented Feb 15, 2022

We didn't update libfabric for quite some, but now have a system where memory bandwidth is an issue and I have some hope that MSG_ZEROCPY might help. Using recent versions has an issue with prov/tcp and bisecting leads to commit 45dce66

A simple IOR test reveals

commit cdd00d3
write 24419 3061.97 0.027907 10158080 8192 0.001211 35.64 0.111349 35.75 0
read 28029 3504.63 0.024385 10158080 8192 0.000610 31.14 0.008115 31.14 0

commit 45dce66
write 23091 3080.07 0.022109 1015808 8192 0.001081 3.54 0.236534 3.78 0
[0010] 2022-02-15 06:26:24 [ warn] ( ime_inode_imcl_rdbuf_flush_dirty_cb:00634) Some read requests aren't served by IME servers for long time (71 sec)
...

[0079] 2022-02-15 06:29:39 [ warn] ( ime_inode_imcl_rdbuf_flush_dirty_cb:00634) Some read requests aren't served by IME servers for long time (73 sec)
read 287.04 56.56 1.11 1015808 8192 0.000504 192.91 111.21 304.12 0

So writes from clients to servers work normally, but reads are very slow. It looks like there is spike of IO from time to time and then a long pause in between.

Now checking what this patch actually does...

@bsbernd
Copy link
Author

bsbernd commented Feb 15, 2022

All IME rpc threads are polling and not getting information that work is to be done

#0 0x00007ffff63360f7 in epoll_wait () from /lib64/libc.so.6
#1 0x00007ffe877ca0e8 in im_net_ofi_poll_blocking () from /opt/ddn/ime/lib/libim_net_ofi.so
#2 0x00007ffe877ca3e0 in im_net_ofi_poll_numa () from /opt/ddn/ime/lib/libim_net_ofi.so
#3 0x0000000000526940 in im_rpc_sched_thread ()
#4 0x00007ffff7b086a7 in im_thread_create_affinized_wrapper () from /opt/ddn/ime/lib/libim_common.so.0
#5 0x00007ffff680e17a in start_thread () from /lib64/libpthread.so.0
#6 0x00007ffff6335dc3 in clone () from /lib64/libc.so.6

@bsbernd
Copy link
Author

bsbernd commented Feb 15, 2022

I guess locking related, switching to a single thread 'fixes' the issue.

@shefty
Copy link
Member

shefty commented Feb 15, 2022

I'm not familiar with IME. What version of libfabric did you use? Can you describe how IME uses libfabric? What type of endpoints? Which data transfer calls (msg, rma, tagged)?

In my testing MSG_ZEROCOPY didn't show any benefit. That was with MPI, but I also didn't spend any time trying to tune zero copy or figuring out why there was no performance gain.

The commit you referenced is more of a code simplification patch than anything functional. There's a list of pending transmit operations. Prior to the commit, the provider would access the operation at the head of the transmit queue when sending data. The commit removes the entry at the head of the queue, tracks it using a pointer instead, and sends the data that way. One of the benefits is that all operations queued on the transmit list are as submitted. While the active operation may have had its headers byte swapped.

@shefty
Copy link
Member

shefty commented Feb 15, 2022

Can you also please confirm that the performance results are consistent for the commits referenced? There shouldn't have been any impact on performance or locking with them.

@bsbernd
Copy link
Author

bsbernd commented Feb 15, 2022

Let's ignore MSG_ZEROCOPY for now, I just hope that it will reduce memory bandwidth consumption, which is a limiting factor on an upcoming system.
IME is just a server client file system implementation, well, mostly a burst buffer. Right now we are at 1.11-patched and that commit is in my way to update. It is absolutely consistent, up to commit cdd00d3 no issue and commit 45dce66 breaks reads from server to client and only if there are multiple polling threads working on the same EP. I can only guess that MPI testing (and also testing done by @sydidelot for red) has a single polling thread for EPs only and so the issue was not noticed so far.

@shefty
Copy link
Member

shefty commented Feb 15, 2022

What version did you try switching to before bisecting?

@bsbernd
Copy link
Author

bsbernd commented Feb 15, 2022

In the order of master, 1.14 1.13 (all 3 bad) and then to 1.12 (good) - bisected in between 1.12 and 1.13.

@bsbernd
Copy link
Author

bsbernd commented Feb 16, 2022

As as a side node, the hanging part is with lots of ep polling threads, which we needed so far for good performance. I was trying to track down which line is causing the issue and kind of reverting back your changes - still ongoing and no success.
But out of fun I tried with 2 threads and still had MSG_ZEROCOPY enabled

reads from server to client:
enabled: 42380MB/s
disabled: 25447MB/s

==> Thanks for your work on MSG_ZEROCOPY!

@shefty
Copy link
Member

shefty commented Feb 16, 2022

To be clear, you're seeing the performance drop with zerocopy disabled, correct?

I've been going through that commit line by line as well and still not seeing anything that should cause an issue. Glad that zero copy is showing improvement for at least 1 app!

@shefty
Copy link
Member

shefty commented Feb 16, 2022

One difference that I've found is that the new code replaces calls to tcpx_tx_pending() with ep->cur_tx_entry. However, that's not a direct replacement.

static inline bool tcpx_tx_pending(struct tcpx_ep *ep)
{
	return !slist_empty(&ep->tx_queue) || ofi_bsock_tosend(&ep->bsock);
}

ep->cur_tx_entry is a suitable replacement for the slist_empty check, but not the bounce buffer check.

@shefty
Copy link
Member

shefty commented Feb 16, 2022

Relative to 45dce66 I think the changes that were made to tcpx_try_func() need to be reverted, but that also tcpx_tx_pending() needs to be updated. However, I see that tcpx_tx_pending() is correct in the latest main, and tcpx_try_func() was also modified in main. So my thoughts on what's off with this commit don't carry over to the latest main.

@shefty
Copy link
Member

shefty commented Feb 16, 2022

Can you please see if shefty@60d0e82 helps any? It looks like there's a bug in the tree at the commit point. But as I mentioned above, if this shows any benefit, I'm still at a lost as to why main didn't work.

@bsbernd
Copy link
Author

bsbernd commented Feb 16, 2022

To be clear, you're seeing the performance drop with zerocopy disabled, correct?

Yes, zerocopy disabled -> 25GB/s, can be tuned up 32GB/s with more threads and other pci tunings, but I'm sure I was then running into memory limitations - like adding more threads kept the corresponding cores busy (one core per thread) without adding more performance, and all busy in kernel copy functions.

I've been going through that commit line by line as well and still not seeing anything that should cause an issue. Glad that zero copy is showing improvement for at least 1 app!

I guess you need to have a system and setup where memory bandwidth is the limiting factor.

@bsbernd
Copy link
Author

bsbernd commented Feb 16, 2022

One difference that I've found is that the new code replaces calls to tcpx_tx_pending() with ep->cur_tx_entry. However, that's not a direct replacement.

static inline bool tcpx_tx_pending(struct tcpx_ep *ep)
{
	return !slist_empty(&ep->tx_queue) || ofi_bsock_tosend(&ep->bsock);
}

ep->cur_tx_entry is a suitable replacement for the slist_empty check, but not the bounce buffer check.

I had actually already several things, as I wrote, I was kind of reverting things line by line, but I didn't get it working so far

diff --git a/prov/tcp/src/tcpx.h b/prov/tcp/src/tcpx.h
index 95ba7f9e6..a272e91e5 100644
--- a/prov/tcp/src/tcpx.h
+++ b/prov/tcp/src/tcpx.h
@@ -354,7 +354,7 @@ void tcpx_tx_queue_insert(struct tcpx_ep *tcpx_ep,
 
 static inline bool tcpx_tx_pending(struct tcpx_ep *ep)
 {
-       return !slist_empty(&ep->tx_queue) || ofi_bsock_tosend(&ep->bsock);
+       return ep->cur_tx_entry || ofi_bsock_tosend(&ep->bsock);
 }

diff --git a/prov/tcp/src/tcpx_progress.c b/prov/tcp/src/tcpx_progress.c
index 38fd5b0ee..9d53ebbec 100644
--- a/prov/tcp/src/tcpx_progress.c
+++ b/prov/tcp/src/tcpx_progress.c
@@ -51,6 +51,7 @@ void tcpx_progress_tx(struct tcpx_ep *ep)
 
        assert(fastlock_held(&ep->lock));
        if (!ep->cur_tx_entry) {
+               assert(slist_empty(&ep->tx_queue));
                (void) ofi_bsock_flush(&ep->bsock);
                return;
        }
@@ -81,6 +82,7 @@ void tcpx_progress_tx(struct tcpx_ep *ep)
        if (!slist_empty(&ep->tx_queue)) {
                ep->cur_tx_entry = container_of(slist_remove_head(&ep->tx_queue),
                                                struct tcpx_xfer_entry, entry);
+               assert(ep->cur_tx_entry != NULL);
        } else {
                ep->cur_tx_entry = NULL;
        }
@@ -687,12 +689,13 @@ int tcpx_try_func(void *util_ep)
                               struct util_wait_fd, util_wait);
 
        fastlock_acquire(&ep->lock);
-       if (ep->cur_tx_entry && !ep->pollout_set) {
+
+       if (tcpx_tx_pending(ep) && !ep->pollout_set) {
                ep->pollout_set = true;
                events = (wait_fd->util_wait.wait_obj == FI_WAIT_FD) ?
                         (OFI_EPOLL_IN | OFI_EPOLL_OUT) : (POLLIN | POLLOUT);
                goto epoll_mod;
-       } else if (!ep->cur_tx_entry && ep->pollout_set) {
+       } else if (!tcpx_tx_pending(ep) && ep->pollout_set) {
                ep->pollout_set = false;
                events = (wait_fd->util_wait.wait_obj == FI_WAIT_FD) ?
                         OFI_EPOLL_IN : POLLIN;
[...]

@shefty
Copy link
Member

shefty commented Feb 16, 2022

This might it the problem:

-		if (!slist_empty(&tcpx_ep->tx_queue) && wait)
+		if (!ep->cur_tx_entry && wait)
			wait->signal(wait);

Old code -- if the tx_queue is not empty, i.e. there are sends that are queued, signal the wait.
New code -- if there's no current entry, i.e. there are no sends queued, signal the wait.

The check is negated.

I'm looking into the purpose of this wait signaling, but this section of code is the same in main, so if this is the problem it carries through the later releases.

@shefty
Copy link
Member

shefty commented Feb 16, 2022

wait->signal was added by a4d80a0
in order to drive progress if there were threads waiting on the CQ. MPI and similar apps busy poll on the CQ and don't block, so would not be impacted by this. If IME does block, it could explain the hangs.

@bsbernd
Copy link
Author

bsbernd commented Feb 16, 2022

Indeed that was it, thanks a lot! I also totally missed the negation. In fact, I had tried to comment out that code section, but it was still hanging (with a perf improvement, though) and then didn't further look...

@shefty
Copy link
Member

shefty commented Feb 16, 2022

What I think is happening is this:

A thread waits for completions on the CQ. At the time that the thread blocks, the socket has no outgoing transfers, so the thread calls select/poll looking for a POLLIN event.

Another thread tries to send a large transfer. A portion of the data is accepted by the socket. The rest of the transfer is queued on the ep. That thread returns. Additional messages may be posted to the ep, but since there's already an active transfer, those messages simply get queued on the tx_queue.

Note that the thread that was blocked on the CQ is still blocked waiting for a POLLIN event. However, it should be waiting for POLLOUT | POLLIN. The signal in this case isn't trying to drive progress, like the commit message suggested, but is needed to update the poll call.

Thanks for confirming. I'll prepare a patch for upstream.

@shefty
Copy link
Member

shefty commented Feb 16, 2022

I have pull requests for main, v1.14.x, and v1.13.x now open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants