Initial implementation of np.unique(return_index=True) #1138

JosephGuman · 2024-05-14T04:21:40Z

PR motivating nv-legate/legate#942
I also decided to create a special task for unzipping indices and values as opposed to modifying legate.core's Reduce implementation after speaking with @magnatelee .
@rohany

… Joseph Guman <joeytg@stanford.edu>

copy-pr-bot · 2024-05-14T04:21:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rohany

In general, please add more comments to the code, especially in places where you have changed some logic in a non-trivial manner. Another good place for comments is when you introduce (reasonable) code duplication -- a small comment saying why this is the case is good for future readers.

rohany · 2024-05-14T06:27:02Z

cunumeric/array.py

@@ -4168,8 +4170,17 @@ def unique(self) -> ndarray:
        Multiple GPUs, Multiple CPUs

        """
-        thunk = self._thunk.unique()
-        return ndarray(shape=thunk.shape, thunk=thunk)
+        thunk = self._thunk.unique(return_index)


[nit] -- i wouldn't call this thunk anymore, since when return_index=True it's not a thunk anymore, it's a tuple of thunks.

rohany · 2024-05-14T06:28:06Z

cunumeric/deferred.py


+        result = None
+        # Assuming legate core will always choose GPU variant


[nit]: Add a comment as to why the branches here are creating the thunk in different ways (i.e. different types)

Extra nit -- comments should be complete sentences, so punctuation and periods please.

rohany · 2024-05-14T06:29:46Z

cunumeric/deferred.py

+        if return_index:
+            returned_indices = self.runtime.create_unbound_thunk(ty.int64)
+            if self.runtime.num_gpus > 0:
+                task.add_output(returned_indices.base)


again, comment this asymmetry between the CPU and GPU implementations

rohany · 2024-05-14T06:31:35Z

cunumeric/deferred.py

+                )
+            if return_index:
+                task = self.context.create_auto_task(CuNumericOpCode.UNZIP)
+                task.add_input(result.base)


this is an unclear way of writing the code, adding result.base as input, overwriting result, and then adding result.base as output.

src/cunumeric/set/unique_reduce_template.inl

tests/integration/test_unique.py

rohany · 2024-05-14T06:46:38Z

src/cunumeric/set/zip_indices.h

+  int64_t index;
+};
+
+// Surprisingly it seems as though thrust can't figure out this comparison


You might not need this if you either:

implement comparison overloads on ZippedIndex

use a builtin type like std::pair instead, which has equality implemented already

rohany · 2024-05-14T06:49:19Z

src/cunumeric/set/unique.cu

+
+      if (return_index) {
+        indices.first.destroy();
+        other_index.first.destroy();


Should you be early releasing this above like other_piece?

I originally wanted to destroy the buffers here largely for the sake of clean code to minimize the number of special blocks for the return_index case. Are you saying we should free these buffers closer to where we release my_piece and other_piece to potentially free up space before
my_piece.first = output.create_output_buffer<VAL, 1>(buf_size);
in line 182?

rohany · 2024-05-14T06:50:19Z

src/cunumeric/set/unique.cu

+                                   sizeof(int64_t) * my_piece.second,
+                                   cudaMemcpyDeviceToDevice,
+                                   stream));
+        merged_index.destroy();


I do not think it is safe for you to do a cudaMemcpyAsync and then destroy merged_index before the cudaMemcpyAsync is guaranteed to be done.

Updating here it is presently safe to do this in Legate as all operations on a given GPU use the same stream. Since this is a tree reduction I think it's probably best to destroy without manually synchronizing to avoid excess latency and duplicating data across multiple iterations of the reduction.

rohany · 2024-05-14T06:53:03Z

src/cunumeric/mapper.cc

@@ -105,7 +105,8 @@ std::vector<StoreMapping> CuNumericMapper::store_mappings(
    }
    case CUNUMERIC_MATMUL:
    case CUNUMERIC_MATVECMUL:
-    case CUNUMERIC_UNIQUE_REDUCE: {
+    case CUNUMERIC_UNIQUE_REDUCE:
+    case CUNUMERIC_UNZIP_INDICES: {


based on looking at the implementation of unzip_indices below, you don't need any special mapping for it and should fall to the default case at the bottom here.

…age when unzipping indices

for more information, see https://pre-commit.ci

rohany · 2024-05-15T17:39:54Z

cunumeric/deferred.py


+        result = None
+        # Assuming legate core will always choose GPU variant


Extra nit -- comments should be complete sentences, so punctuation and periods please.

src/cunumeric/set/unzip_indices_template.inl

tests/integration/test_unique.py

Initial implementation of np.unique(return_index=True) Signed-off-by:…

52e71aa

… Joseph Guman <joeytg@stanford.edu>

rohany requested review from magnatelee and manopapad May 14, 2024 04:39

rohany requested changes May 14, 2024

View reviewed changes

Joseph Thomas Guman and others added 2 commits May 14, 2024 16:39

Fixing some nits, expanding tests, and removing unnecessary thrust us…

50a11de

…age when unzipping indices

[pre-commit.ci] auto fixes from pre-commit.com hooks

e7b0b3a

for more information, see https://pre-commit.ci

rohany reviewed May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of np.unique(return_index=True) #1138

Initial implementation of np.unique(return_index=True) #1138

JosephGuman commented May 14, 2024

copy-pr-bot bot commented May 14, 2024

rohany left a comment

rohany May 14, 2024

rohany May 14, 2024

rohany May 15, 2024

rohany May 14, 2024

rohany May 14, 2024

rohany May 14, 2024

rohany May 14, 2024

JosephGuman May 21, 2024

rohany May 14, 2024

JosephGuman May 25, 2024

rohany May 14, 2024

rohany May 15, 2024


		result = None
		# Assuming legate core will always choose GPU variant

Initial implementation of np.unique(return_index=True) #1138

Are you sure you want to change the base?

Initial implementation of np.unique(return_index=True) #1138

Conversation

JosephGuman commented May 14, 2024

copy-pr-bot bot commented May 14, 2024

rohany left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment