Make evictions safe again (for TTG) #680

devreal · 2024-10-15T20:49:14Z

Description

@therault and I talked through an issue we have in TTG that does not appear to exist in other DSLs: objects in TTG have arbitrary life-times so they can exist for some time and then disappear. We see this for task-local temporary data but it could happen across multiple tasks and even across multiple devices. That data can be write-only (e.g., scratch space) on the device.

When the data_t is destroyed, it's host-side backing memory is released (that is the last time we see the data_t in TTG). We detach the host data copy (through parsec_data_copy_detach) but we cannot touch the device data copies (because TTG does not own them).

Now comes the eviction. Since it has not been written back to the host and was not a read-only copy (like temporaries in PTG for example) the device tries to evict the data to the host. But the host data does not exist anymore. And we don't care about that data anymore.

Describe the solution you'd like

We need some way of telling PaRSEC that a data is discarded and that it's life-time has effectively ended. In the destructor of my user-level object I'd call parsec_data_discard(), which will mark all device copies as discarded and not to be evicted, only collected and returned back to the zone allocator. We cannot return them in the calling thread since we may or may not be the manager thread for that particular device (if the data was shared across devices).

After this call, I can release the host memory because I know that parsec will not try to transfer data back into it.

Describe alternatives you've considered

@therault and I considered using the flow to mark data as abandoned but that will not work if the data exists on multiple devices because we only know that the data is abandoned once the last device has completed. We also only know about the end of life of objects in the completion callback, after the flows have been handled.

Additional context

One wrinkle in all of that is that the concurrent nature of device management may lead to race conditions where one thread marks the data as discarded (and subsequently frees the host pointer) while another thread tries to evict the data copy from its device. We could provide a callback to release the host pointer once all device data has been evicted or known to not be evicted anymore.

Alternatively, we could have data that cannot be evicted (because we never provide a host copy). For task-local scratch space, that is perfectly fine. For temporaries with longer life this would be desirable but it could lead to problems if used excessively because we would never be able to evict to the host, potentially cause a deadlock.

The text was updated successfully, but these errors were encountered:

devreal added the enhancement New feature or request label Oct 15, 2024

devreal assigned bosilca and therault Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make evictions safe again (for TTG) #680

Make evictions safe again (for TTG) #680

devreal commented Oct 15, 2024 •

edited by bosilca

Loading

Make evictions safe again (for TTG) #680

Make evictions safe again (for TTG) #680

Comments

devreal commented Oct 15, 2024 • edited by bosilca Loading

Description

Describe the solution you'd like

Describe alternatives you've considered

Additional context

devreal commented Oct 15, 2024 •

edited by bosilca

Loading