Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_to_host locks too much #8586

Open
abadams opened this issue Mar 6, 2025 · 2 comments
Open

copy_to_host locks too much #8586

abadams opened this issue Mar 6, 2025 · 2 comments
Labels
dev_meeting Topic to be discussed at the next dev meeting performance

Comments

@abadams
Copy link
Member

abadams commented Mar 6, 2025

Currently in device_interface.cpp we grab a big global lock in any of the device API calls. E.g. copy_to_device grabs a lock before it does anything to the halide_buffer_t. The device APIs themselves are responsible for being thread-safe, so this isn't to protect internal device API state - it's to protect the halide_buffer_t state.

I think the intent of this lock is to make device api calls on a buffer atomic with respect to other device api calls on that same buffer struct. E.g. if two threads are copying to device at the same time on the same buffer, you don't want to do two device mallocs and leak one of them due to a race. There may also be synchronization issues around setting dirty flags after doing the call that makes those flags correct, and having something on another thread happen in between.

However, no-op copies to device are also grabbing the global lock, so this pointlessly serializes all Halide pipelines in cases where all buffers are already all on device and not dirty on host. This has caused a performance issue in production.

We need to figure out a more performant way to make these device api operations atomic with respect to each other (if that is indeed the intent). We probably can't say "you're not allowed to modify a buffer_t on multiple threads at once, because more will go wrong than just this anyway", because at this point someone might be relying on these locks for correctness. We can probably move the locks inwards so that they only surround the code that actually modifies buffer state though. I think it's ok to check if the copy would be a no-op outside the lock, though we have to be very careful about this because it's double-checked locking. We need to ensure that the check outside the lock can't be confused by a partially-modified state the buffer_t is in on another thread inside the lock.

For copy_to_device the early-out check is buffer->device && !buffer->host_dirty(). So it touches two fields. Yay. I guess we need to consider all potential interleavings of reading these two fields and the three mutating operations done inside the lock (allocating, copying, setting host dirty to false), keeping in mind the compiler's freedom to move the set host dirty call within the locked section, and potential changes future maintainers make within the locked section (they will assume it's locked, so it's fine to have all sorts of weird intermediate states). In this case I think they all work out OK, but it sure is tricky to think about.

Another option would be to not have a global lock, and instead attach the lock to the buffer itself using a bit in the flags field as a cas-loop-based spinlock to protect buffer device state. Maybe that's more robust. In the case we have in practice there would be no contention on this lock.

Eager to hear thoughts from @zvookin in particular.

@abadams abadams added dev_meeting Topic to be discussed at the next dev meeting performance labels Mar 6, 2025
@slomp
Copy link
Contributor

slomp commented Mar 6, 2025

For context, using the example of halide_copy_to_device, this is the lock that's been alluded:

ScopedMutexLock lock(&device_copy_mutex);

and these are the most "critical" places where buf needs to be inspected or altered:
result = halide_device_malloc(user_context, buf, device_interface);

if (buf->device_dirty()) {

result = device_interface->impl->copy_to_device(user_context, buf);

buf->set_host_dirty(false);

@slomp
Copy link
Contributor

slomp commented Mar 6, 2025

One possible solution would be to have a small "associative-set" of mutexes (mutices?) and obtain the corresponding mutex by looking-up into the set using the buffer pointer as a key. This should at the very least reduce contention when the interface is called concurrently on different buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev_meeting Topic to be discussed at the next dev meeting performance
Projects
None yet
Development

No branches or pull requests

2 participants