Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide mechanism to discard data #695

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Nov 8, 2024

Add a function parsec_data_discard that releases the data such that the host copy remains intact but does not prevent destruction of the data once all device copies have been released. This keeps the host copy available for device copies to inspect and avoids potential race conditions in the release process. During an eviction, copies of data with a discarded host copy are not transfered but put directly into the lru.

Replaces some duplicated code with a call to parsec_device_release_gpu_copy.

@devreal devreal requested a review from a team as a code owner November 8, 2024 19:27
@devreal devreal force-pushed the parsec-data-discard branch 2 times, most recently from bbd1448 to 1dd2d54 Compare November 8, 2024 19:59
parsec/mca/device/transfer_gpu.c Outdated Show resolved Hide resolved
parsec/data.c Show resolved Hide resolved
parsec/data.h Outdated Show resolved Hide resolved
parsec/data.c Outdated
/* second, mark the host copy as discarded */
parsec_data_copy_t *cpu_copy = data->device_copies[0];
if (NULL != cpu_copy) {
cpu_copy->flags = PARSEC_DATA_FLAG_DISCARDED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three issues here:

  • the flag should be set with |=
  • you should only release the data_t is the cpu_copy is valid. You do a test against NULL, you should use it.
  • this function is not thread-safe. This could not be an issue here, but I see it used by the GPU code without protection either, opening the code to race-conditions while manipulating these discarded copies/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I fixed setting the flag, added a lock to protect the check for the change to the device copy, and only now the data_t is only released twice if there is a device copy.

@@ -2342,7 +2333,17 @@ parsec_device_kernel_epilog( parsec_device_gpu_module_t *gpu_device,

assert( 0 <= gpu_copy->readers );

if( gpu_task->pushout & (1 << i) ) {
if (cpu_copy->flags & PARSEC_DATA_FLAG_DISCARDED) {
PARSEC_DEBUG_VERBOSE(20, parsec_gpu_output_stream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where the race-condition could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a big issue here since we're not modifying the flags and neither the data_t nor the host copy will be freed, since the device copy maintains its reference. The worst thing that can happen is that we miss the flag being set and the release is delayed until we try to evict.

parsec_atomic_lock( &gpu_copy->original->lock );
/* get the next item before altering the next pointer */
item = (parsec_list_item_t*)item->list_next; /* conversion needed for volatile */
if( 0 == gpu_copy->readers ) {
if (cpu_copy->flags & PARSEC_DATA_FLAG_DISCARDED) {
parsec_list_item_ring_chop((parsec_list_item_t*)gpu_copy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. The flag isn't changed and the device copy maintains a reference on the data_t so if we miss the update of the flags here we will evict into the host copy and then release everything.

@devreal devreal force-pushed the parsec-data-discard branch 2 times, most recently from dde7129 to 14934a5 Compare November 20, 2024 23:14
Add a function `parsec_data_discard` that releases the data
such that the host copy remains intact but does not prevent
destruction of the data once all device copies have been released.
This keeps the host copy available for device copies to inspect
and avoids potential race conditions in the release process.
During an eviction, copies of data with a discarded host copy
are not transfered but put directly into the lru.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Otherwise we cannot destroy empty or discarded data.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Also OR the flag instead of assigning it.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Discarded data may never be pushed back so don't warn about it
still being owned by the device.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Discarded data sit toward the end of the lru while the data
to be evicted is at the front. We walk both forward and backward
to collect the discarded data from the back, until we either meet the
pivot or we found enough data to evict. If we discarded data we don't
evict.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@devreal devreal force-pushed the parsec-data-discard branch from 397ca9b to 53b7721 Compare December 23, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants