Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of np.unique(return_index=True) #1138

Open
wants to merge 3 commits into
base: branch-24.03
Choose a base branch
from

Conversation

JosephGuman
Copy link

PR motivating nv-legate/legate#942
I also decided to create a special task for unzipping indices and values as opposed to modifying legate.core's Reduce implementation after speaking with @magnatelee .
@rohany

Copy link

copy-pr-bot bot commented May 14, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rohany rohany requested review from magnatelee and manopapad May 14, 2024 04:39
Copy link
Member

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, please add more comments to the code, especially in places where you have changed some logic in a non-trivial manner. Another good place for comments is when you introduce (reasonable) code duplication -- a small comment saying why this is the case is good for future readers.

@@ -4168,8 +4170,17 @@ def unique(self) -> ndarray:
Multiple GPUs, Multiple CPUs

"""
thunk = self._thunk.unique()
return ndarray(shape=thunk.shape, thunk=thunk)
thunk = self._thunk.unique(return_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] -- i wouldn't call this thunk anymore, since when return_index=True it's not a thunk anymore, it's a tuple of thunks.


result = None
# Assuming legate core will always choose GPU variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: Add a comment as to why the branches here are creating the thunk in different ways (i.e. different types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra nit -- comments should be complete sentences, so punctuation and periods please.

if return_index:
returned_indices = self.runtime.create_unbound_thunk(ty.int64)
if self.runtime.num_gpus > 0:
task.add_output(returned_indices.base)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, comment this asymmetry between the CPU and GPU implementations

)
if return_index:
task = self.context.create_auto_task(CuNumericOpCode.UNZIP)
task.add_input(result.base)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an unclear way of writing the code, adding result.base as input, overwriting result, and then adding result.base as output.

src/cunumeric/set/unique_reduce_template.inl Outdated Show resolved Hide resolved
tests/integration/test_unique.py Show resolved Hide resolved
int64_t index;
};

// Surprisingly it seems as though thrust can't figure out this comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need this if you either:

  1. implement comparison overloads on ZippedIndex
  2. use a builtin type like std::pair instead, which has equality implemented already


if (return_index) {
indices.first.destroy();
other_index.first.destroy();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you be early releasing this above like other_piece?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally wanted to destroy the buffers here largely for the sake of clean code to minimize the number of special blocks for the return_index case. Are you saying we should free these buffers closer to where we release my_piece and other_piece to potentially free up space before
my_piece.first = output.create_output_buffer<VAL, 1>(buf_size);
in line 182?

sizeof(int64_t) * my_piece.second,
cudaMemcpyDeviceToDevice,
stream));
merged_index.destroy();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is safe for you to do a cudaMemcpyAsync and then destroy merged_index before the cudaMemcpyAsync is guaranteed to be done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating here it is presently safe to do this in Legate as all operations on a given GPU use the same stream. Since this is a tree reduction I think it's probably best to destroy without manually synchronizing to avoid excess latency and duplicating data across multiple iterations of the reduction.

@@ -105,7 +105,8 @@ std::vector<StoreMapping> CuNumericMapper::store_mappings(
}
case CUNUMERIC_MATMUL:
case CUNUMERIC_MATVECMUL:
case CUNUMERIC_UNIQUE_REDUCE: {
case CUNUMERIC_UNIQUE_REDUCE:
case CUNUMERIC_UNZIP_INDICES: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on looking at the implementation of unzip_indices below, you don't need any special mapping for it and should fall to the default case at the bottom here.


result = None
# Assuming legate core will always choose GPU variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra nit -- comments should be complete sentences, so punctuation and periods please.

src/cunumeric/set/unzip_indices_template.inl Show resolved Hide resolved
tests/integration/test_unique.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants