Switch from fine grained locking throughout the code base to device and domain level locking #743

bwbarrett · 2024-12-06T20:32:31Z

Switch from finegrained locking in all the various data structures to device and domain level locks. This means that we have slightly higher contention on the locks if we're in a multi-accelerator per process region and not using domain-per-thread (which we do on trainium). In return, the locking code is way easier to understand and we (literally) remove thousands of locks (possible we have way too many request structures). This probably isn't the right end state, but is a major step forward and a good point to checkpoint.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The sendrecv code includes a number of places that were referencing the device object solely to get the max_tag value. Simplify things and reduce cache usage by including a copy of max_tag in the endpoint structure. Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Switch from fine grained locking and forcing Libfabric to support FI_THREAD_SAFE to a large domain-level lock held across all non-trivial Libfabric calls. This enabled us to lower support to FI_THREAD_DOMAIN in a later patch. Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Now that the transports have a big lock around shared accesses in either the domain or device, there is no need for the fine grained locking we had been doing. Remove all the locks from the various utility interfaces. Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Lower the required threading limits to FI_THREAD_DOMAIN, with control progress set to FI_PROGRESS_CONTROL_UNIFIED. On Libfabric providers that use the utility completion queues, this will result in no locks for both the send/recv and cq polling calls inside Libfabric, since we already have domain-level exclusivity in the transports. Signed-off-by: Brian Barrett <bbarrett@amazon.com>

Remove the functionality to create an endpoint per thread. Originally, this was used to create a full domain/endpoint/cq set of objects per thread that created a communicator, but that got messy as the rdma transport created multiple endpoints per thread based on the comm in use. So we ended up in a place where we were creating a domain per thread (sometimes) and an endpoint per thread (always), which was messy but worked. With the switch from requesting FI_THREAD_SAFE to FI_THREAD_DOMAIN and the concurrent switch to domain-level locking for plugin operations, there really isn't much advantage to the endpoint per thread model, so this patch removes all that logic. Signed-off-by: Brian Barrett <bbarrett@amazon.com>

rauteric · 2024-12-12T23:56:12Z

The approach looks reasonable to me. When this PR is taken out of draft mode I may come up with some minor comments.

aws-nslick · 2024-12-13T04:38:11Z

LGTM -- but all I'm really looking for here is that the big fat lock is taken at every entrypoint and held for the duration. I'm mostly excited that this finally kills the per-req mutex we had -- way overkill considering it was only there for the sake of protecting two counters that just as easily could have been atomics.

Would really like to see this get in ASAP, ideally before any holiday downtime. Any specific reason you still have it as WIP? Can rebase most of my branches atop this and test with it if it's just a matter of gaining confidence with testing.

aws-nslick · 2024-12-13T04:45:38Z

I halfheartedly worry that we're losing something here with the overall approach -- it's more natural for structures that expect to be shared across threads to take responsibility for their own consistency, and by doing so it's immediately clear at a glance whether something expects/tolerates usage across threads. I wouldn't ever block on it but it's worth mentioning.

rauteric · 2024-12-13T19:10:39Z

src/nccl_ofi_rdma.c

@@ -98,7 +98,6 @@ static size_t eager_max_size = 0;
 /* List of comms undergoing deferred cleanup */
 static nccl_ofi_deque_t *s_comm_cleanup_list = NULL;
 static nccl_ofi_deque_t *r_comm_cleanup_list = NULL;
-static pthread_mutex_t comm_cleanup_list_lock = PTHREAD_MUTEX_INITIALIZER;


This is a bit confusing but I think you'll still need this one? The cleanup list is at process level, not a domain level.

Two threads could simultaneously close two different communicators, and they'll both try to add their communicators to these cleanup lists.

bwbarrett marked this pull request as ready for review December 6, 2024 20:32

bwbarrett requested review from rauteric and a team as code owners December 6, 2024 20:32

aws-nslick approved these changes Dec 9, 2024

View reviewed changes

bwbarrett marked this pull request as draft December 9, 2024 22:20

bwbarrett added 6 commits December 11, 2024 20:25

bwbarrett force-pushed the feature/simplify-locking branch from 34ddea2 to aea255c Compare December 11, 2024 21:21

rauteric reviewed Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from fine grained locking throughout the code base to device and domain level locking #743

Switch from fine grained locking throughout the code base to device and domain level locking #743

bwbarrett commented Dec 6, 2024

rauteric commented Dec 12, 2024

aws-nslick commented Dec 13, 2024

aws-nslick commented Dec 13, 2024

rauteric Dec 13, 2024

Switch from fine grained locking throughout the code base to device and domain level locking #743

Are you sure you want to change the base?

Switch from fine grained locking throughout the code base to device and domain level locking #743

Conversation

bwbarrett commented Dec 6, 2024

rauteric commented Dec 12, 2024

aws-nslick commented Dec 13, 2024

aws-nslick commented Dec 13, 2024

rauteric Dec 13, 2024

Choose a reason for hiding this comment