You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently in device_interface.cpp we grab a big global lock in any of the device API calls. E.g. copy_to_device grabs a lock before it does anything to the halide_buffer_t. The device APIs themselves are responsible for being thread-safe, so this isn't to protect internal device API state - it's to protect the halide_buffer_t state.
I think the intent of this lock is to make device api calls on a buffer atomic with respect to other device api calls on that same buffer struct. E.g. if two threads are copying to device at the same time on the same buffer, you don't want to do two device mallocs and leak one of them due to a race. There may also be synchronization issues around setting dirty flags after doing the call that makes those flags correct, and having something on another thread happen in between.
However, no-op copies to device are also grabbing the global lock, so this pointlessly serializes all Halide pipelines in cases where all buffers are already all on device and not dirty on host. This has caused a performance issue in production.
We need to figure out a more performant way to make these device api operations atomic with respect to each other (if that is indeed the intent). We probably can't say "you're not allowed to modify a buffer_t on multiple threads at once, because more will go wrong than just this anyway", because at this point someone might be relying on these locks for correctness. We can probably move the locks inwards so that they only surround the code that actually modifies buffer state though. I think it's ok to check if the copy would be a no-op outside the lock, though we have to be very careful about this because it's double-checked locking. We need to ensure that the check outside the lock can't be confused by a partially-modified state the buffer_t is in on another thread inside the lock.
For copy_to_device the early-out check is buffer->device && !buffer->host_dirty(). So it touches two fields. Yay. I guess we need to consider all potential interleavings of reading these two fields and the three mutating operations done inside the lock (allocating, copying, setting host dirty to false), keeping in mind the compiler's freedom to move the set host dirty call within the locked section, and potential changes future maintainers make within the locked section (they will assume it's locked, so it's fine to have all sorts of weird intermediate states). In this case I think they all work out OK, but it sure is tricky to think about.
Another option would be to not have a global lock, and instead attach the lock to the buffer itself using a bit in the flags field as a cas-loop-based spinlock to protect buffer device state. Maybe that's more robust. In the case we have in practice there would be no contention on this lock.
Eager to hear thoughts from @zvookin in particular.
The text was updated successfully, but these errors were encountered:
One possible solution would be to have a small "associative-set" of mutexes (mutices?) and obtain the corresponding mutex by looking-up into the set using the buffer pointer as a key. This should at the very least reduce contention when the interface is called concurrently on different buffers.
Currently in device_interface.cpp we grab a big global lock in any of the device API calls. E.g. copy_to_device grabs a lock before it does anything to the halide_buffer_t. The device APIs themselves are responsible for being thread-safe, so this isn't to protect internal device API state - it's to protect the halide_buffer_t state.
I think the intent of this lock is to make device api calls on a buffer atomic with respect to other device api calls on that same buffer struct. E.g. if two threads are copying to device at the same time on the same buffer, you don't want to do two device mallocs and leak one of them due to a race. There may also be synchronization issues around setting dirty flags after doing the call that makes those flags correct, and having something on another thread happen in between.
However, no-op copies to device are also grabbing the global lock, so this pointlessly serializes all Halide pipelines in cases where all buffers are already all on device and not dirty on host. This has caused a performance issue in production.
We need to figure out a more performant way to make these device api operations atomic with respect to each other (if that is indeed the intent). We probably can't say "you're not allowed to modify a buffer_t on multiple threads at once, because more will go wrong than just this anyway", because at this point someone might be relying on these locks for correctness. We can probably move the locks inwards so that they only surround the code that actually modifies buffer state though. I think it's ok to check if the copy would be a no-op outside the lock, though we have to be very careful about this because it's double-checked locking. We need to ensure that the check outside the lock can't be confused by a partially-modified state the buffer_t is in on another thread inside the lock.
For copy_to_device the early-out check is buffer->device && !buffer->host_dirty(). So it touches two fields. Yay. I guess we need to consider all potential interleavings of reading these two fields and the three mutating operations done inside the lock (allocating, copying, setting host dirty to false), keeping in mind the compiler's freedom to move the set host dirty call within the locked section, and potential changes future maintainers make within the locked section (they will assume it's locked, so it's fine to have all sorts of weird intermediate states). In this case I think they all work out OK, but it sure is tricky to think about.
Another option would be to not have a global lock, and instead attach the lock to the buffer itself using a bit in the flags field as a cas-loop-based spinlock to protect buffer device state. Maybe that's more robust. In the case we have in practice there would be no contention on this lock.
Eager to hear thoughts from @zvookin in particular.
The text was updated successfully, but these errors were encountered: