Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from fine grained locking throughout the code base to device and domain level locking #743

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bwbarrett
Copy link
Contributor

Switch from finegrained locking in all the various data structures to device and domain level locks. This means that we have slightly higher contention on the locks if we're in a multi-accelerator per process region and not using domain-per-thread (which we do on trainium). In return, the locking code is way easier to understand and we (literally) remove thousands of locks (possible we have way too many request structures). This probably isn't the right end state, but is a major step forward and a good point to checkpoint.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwbarrett bwbarrett marked this pull request as ready for review December 6, 2024 20:32
@bwbarrett bwbarrett requested review from rauteric and a team as code owners December 6, 2024 20:32
@bwbarrett bwbarrett marked this pull request as draft December 9, 2024 22:20
The sendrecv code includes a number of places that were referencing
the device object solely to get the max_tag value.  Simplify things
and reduce cache usage by including a copy of max_tag in the endpoint
structure.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Switch from fine grained locking and forcing Libfabric to support
FI_THREAD_SAFE to a large domain-level lock held across all non-trivial
Libfabric calls.  This enabled us to lower support to FI_THREAD_DOMAIN
in a later patch.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Switch from fine grained locking and forcing Libfabric to support
FI_THREAD_SAFE to a large domain-level lock held across all non-trivial
Libfabric calls.  This enabled us to lower support to FI_THREAD_DOMAIN
in a later patch.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Now that the transports have a big lock around shared accesses
in either the domain or device, there is no need for the fine
grained locking we had been doing.  Remove all the locks from
the various utility interfaces.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Lower the required threading limits to FI_THREAD_DOMAIN, with control
progress set to FI_PROGRESS_CONTROL_UNIFIED.  On Libfabric providers
that use the utility completion queues, this will result in no locks
for both the send/recv and cq polling calls inside Libfabric, since
we already have domain-level exclusivity in the transports.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Remove the functionality to create an endpoint per thread.  Originally,
this was used to create a full domain/endpoint/cq set of objects per
thread that created a communicator, but that got messy as the rdma
transport created multiple endpoints per thread based on the comm
in use.  So we ended up in a place where we were creating a domain
per thread (sometimes) and an endpoint per thread (always), which was
messy but worked.

With the switch from requesting FI_THREAD_SAFE to FI_THREAD_DOMAIN and
the concurrent switch to domain-level locking for plugin operations,
there really isn't much advantage to the endpoint per thread model,
so this patch removes all that logic.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett bwbarrett force-pushed the feature/simplify-locking branch from 34ddea2 to aea255c Compare December 11, 2024 21:21
@rauteric
Copy link
Contributor

The approach looks reasonable to me. When this PR is taken out of draft mode I may come up with some minor comments.

@aws-nslick
Copy link
Contributor

LGTM -- but all I'm really looking for here is that the big fat lock is taken at every entrypoint and held for the duration. I'm mostly excited that this finally kills the per-req mutex we had -- way overkill considering it was only there for the sake of protecting two counters that just as easily could have been atomics.

Would really like to see this get in ASAP, ideally before any holiday downtime. Any specific reason you still have it as WIP? Can rebase most of my branches atop this and test with it if it's just a matter of gaining confidence with testing.

@aws-nslick
Copy link
Contributor

I halfheartedly worry that we're losing something here with the overall approach -- it's more natural for structures that expect to be shared across threads to take responsibility for their own consistency, and by doing so it's immediately clear at a glance whether something expects/tolerates usage across threads. I wouldn't ever block on it but it's worth mentioning.

@@ -98,7 +98,6 @@ static size_t eager_max_size = 0;
/* List of comms undergoing deferred cleanup */
static nccl_ofi_deque_t *s_comm_cleanup_list = NULL;
static nccl_ofi_deque_t *r_comm_cleanup_list = NULL;
static pthread_mutex_t comm_cleanup_list_lock = PTHREAD_MUTEX_INITIALIZER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing but I think you'll still need this one? The cleanup list is at process level, not a domain level.

Two threads could simultaneously close two different communicators, and they'll both try to add their communicators to these cleanup lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants