copy_to_host locks too much #8586

abadams · 2025-03-06T19:19:28Z

Currently in device_interface.cpp we grab a big global lock in any of the device API calls. E.g. copy_to_device grabs a lock before it does anything to the halide_buffer_t. The device APIs themselves are responsible for being thread-safe, so this isn't to protect internal device API state - it's to protect the halide_buffer_t state.

I think the intent of this lock is to make device api calls on a buffer atomic with respect to other device api calls on that same buffer struct. E.g. if two threads are copying to device at the same time on the same buffer, you don't want to do two device mallocs and leak one of them due to a race. There may also be synchronization issues around setting dirty flags after doing the call that makes those flags correct, and having something on another thread happen in between.

However, no-op copies to device are also grabbing the global lock, so this pointlessly serializes all Halide pipelines in cases where all buffers are already all on device and not dirty on host. This has caused a performance issue in production.

We need to figure out a more performant way to make these device api operations atomic with respect to each other (if that is indeed the intent). We probably can't say "you're not allowed to modify a buffer_t on multiple threads at once, because more will go wrong than just this anyway", because at this point someone might be relying on these locks for correctness. We can probably move the locks inwards so that they only surround the code that actually modifies buffer state though. I think it's ok to check if the copy would be a no-op outside the lock, though we have to be very careful about this because it's double-checked locking. We need to ensure that the check outside the lock can't be confused by a partially-modified state the buffer_t is in on another thread inside the lock.

For copy_to_device the early-out check is buffer->device && !buffer->host_dirty(). So it touches two fields. Yay. I guess we need to consider all potential interleavings of reading these two fields and the three mutating operations done inside the lock (allocating, copying, setting host dirty to false), keeping in mind the compiler's freedom to move the set host dirty call within the locked section, and potential changes future maintainers make within the locked section (they will assume it's locked, so it's fine to have all sorts of weird intermediate states). In this case I think they all work out OK, but it sure is tricky to think about.

Another option would be to not have a global lock, and instead attach the lock to the buffer itself using a bit in the flags field as a cas-loop-based spinlock to protect buffer device state. Maybe that's more robust. In the case we have in practice there would be no contention on this lock.

Eager to hear thoughts from @zvookin in particular.

slomp · 2025-03-06T22:54:00Z

For context, using the example of halide_copy_to_device, this is the lock that's been alluded:

Halide/src/runtime/device_interface.cpp

Line 210 in 7356c6d

ScopedMutexLock lock(&device_copy_mutex);

and these are the most "critical" places where buf needs to be inspected or altered:

Halide/src/runtime/device_interface.cpp

Line 176 in 7356c6d

result = halide_device_malloc(user_context, buf, device_interface);

Halide/src/runtime/device_interface.cpp

Line 186 in 7356c6d

if (buf->device_dirty()) {

Halide/src/runtime/device_interface.cpp

Line 191 in 7356c6d

result = device_interface->impl->copy_to_device(user_context, buf);

Halide/src/runtime/device_interface.cpp

Line 193 in 7356c6d

buf->set_host_dirty(false);

slomp · 2025-03-06T22:57:22Z

One possible solution would be to have a small "associative-set" of mutexes (mutices?) and obtain the corresponding mutex by looking-up into the set using the buffer pointer as a key. This should at the very least reduce contention when the interface is called concurrently on different buffers.

abadams added dev_meeting Topic to be discussed at the next dev meeting performance labels Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy_to_host locks too much #8586

copy_to_host locks too much #8586

abadams commented Mar 6, 2025

slomp commented Mar 6, 2025

slomp commented Mar 6, 2025

copy_to_host locks too much #8586

copy_to_host locks too much #8586

Comments

abadams commented Mar 6, 2025

slomp commented Mar 6, 2025

slomp commented Mar 6, 2025