rapidsai · GPUtester · Jan 31, 2023 · Jan 31, 2023
diff --git a/README.md b/README.md
@@ -25,8 +25,8 @@ While not exhaustive, the following general categories help summarize the accele
 | Category | Examples |
 | --- | --- |
 | **Data Formats** | sparse & dense, conversions, data generation |
-| **Dense Operations** | linear algebra, matrix and vector operations, slicing, norms, factorization, least squares, svd & eigenvalue problems |
-| **Sparse Operations** | linear algebra, eigenvalue problems, slicing, symmetrization, components & labeling |
+| **Dense Operations** | linear algebra, matrix and vector operations, reductions, slicing, norms, factorization, least squares, svd & eigenvalue problems |
+| **Sparse Operations** | linear algebra, eigenvalue problems, slicing, norms, reductions, factorization, symmetrization, components & labeling |
 | **Spatial** | pairwise distances, nearest neighbors, neighborhood graph construction |
 | **Basic Clustering** | spectral clustering, hierarchical clustering, k-means |
 | **Solvers** | combinatorial optimization, iterative solvers |
@@ -65,17 +65,17 @@ auto matrix = raft::make_device_matrix<float>(handle, n_rows, n_cols);
 
 ### C++ Example
 
-Most of the primitives in RAFT accept a `raft::handle_t` object for the management of resources which are expensive to create, such CUDA streams, stream pools, and handles to other CUDA libraries like `cublas` and `cusolver`.
+Most of the primitives in RAFT accept a `raft::device_resources` object for the management of resources which are expensive to create, such CUDA streams, stream pools, and handles to other CUDA libraries like `cublas` and `cusolver`.
 
 The example below demonstrates creating a RAFT handle and using it with `device_matrix` and `device_vector` to allocate memory, generating random clusters, and computing
 pairwise Euclidean distances:
 ```c++
-#include <raft/core/handle.hpp>
+#include <raft/core/device_resources.hpp>
 #include <raft/core/device_mdarray.hpp>
 #include <raft/random/make_blobs.cuh>
 #include <raft/distance/distance.cuh>
 
-raft::handle_t handle;
+raft::device_resources handle;
 
 int n_samples = 5000;
 int n_features = 50;
@@ -93,12 +93,12 @@ raft::distance::pairwise_distance(handle, input.view(), input.view(), output.vie
 It's also possible to create `raft::device_mdspan` views to invoke the same API with raw pointers and shape information:
 
 ```c++
-#include <raft/core/handle.hpp>
+#include <raft/core/device_resources.hpp>
 #include <raft/core/device_mdspan.hpp>
 #include <raft/random/make_blobs.cuh>
 #include <raft/distance/distance.cuh>
 
-raft::handle_t handle;
+raft::device_resources handle;
 
 int n_samples = 5000;
 int n_features = 50;

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -23,7 +23,7 @@ While not exhaustive, the following general categories help summarize the accele
    * - Dense Operations
      - linear algebra, matrix and vector operations, slicing, norms, factorization, least squares, svd & eigenvalue problems
    * - Sparse Operations
-     - linear algebra, arithmetic, eigenvalue problems, slicing, symmetrization, components & labeling
+     - linear algebra, eigenvalue problems, slicing, norms, reductions, factorization, symmetrization, components & labeling
    * - Spatial
      - pairwise distances, nearest neighbors, neighborhood graph construction
    * - Basic Clustering
@@ -45,6 +45,7 @@ While not exhaustive, the following general categories help summarize the accele
    cpp_api.rst
    pylibraft_api.rst
    raft_dask_api.rst
+   using_comms.rst
    contributing.md
 
 

diff --git a/docs/source/using_comms.rst b/docs/source/using_comms.rst
@@ -0,0 +1,97 @@
+Using RAFT Comms
+================
+
+RAFT provides a communications abstraction for writing distributed algorithms which can scale up to multiple GPUs and scale out to multiple nodes. The communications abstraction is largely based on MPI and NCCL, and allows the user to decouple the design of algorithms from the environments where the algorithms are executed, enabling “write-once deploy everywhere” semantics. Currently, the distributed algorithms in both cuGraph and cuML are being deployed in both MPI and Dask clusters while cuML’s distributed algorithms are also being deployed on GPUs in Apache Spark clusters. This is a powerful concept as distributed algorithms can be non-trivial to write and so maintainability is eased and bug fixes reach further by increasing reuse as much as possible.
+
+While users of RAFT’s communications layer largely get MPI integration for free just by installing MPI and using `mpirun` to run their applications, the `raft-dask` Python package provides a mechanism for executing algorithms written using RAFT’s communications layer in a Dask cluster. It will help to walk through a small example of how one would build an algorithm with RAFT’s communications layer.
+
+First, an instance of `raft::comms_t` is passed through the `raft::device_resources` instance and code is written to utilize collective and/or point-to-point communications as needed.
+
+.. code-block:: cpp
+   :caption: Example function written with the RAFT comms API
+
+   #include <raft/core/comms.hpp>
+   #include <raft/core/device_mdspan.hpp>
+   #include <raft/util/cudart_utils.hpp>
+
+   void test_allreduce(raft::device_resources const &handle, int root) {
+     raft::comms::comms_t const& communicator = handle.get_comms();
+     cudaStream_t stream = handle.get_stream();
+     raft::device_scalar<int> temp_scalar(stream);
+
+     int to_send = 1;
+     raft::copy(temp_scalar.data(), &to_send, 1, stream);
+     communicator.allreduce(temp_scalar.data(), temp_scalar.data(), 1,
+                            raft::comms::opt_t::SUM, stream);
+     handle.sync_stream();
+   }
+
+This exact function can now be executed in several different types of GPU clusters. For example, it can be executed with MPI by initializing an instance of `raft::comms::mpi_comms` with the `MPI_Comm`:
+
+.. code-block:: cpp
+   :caption: Example of running test_allreduce() in MPI
+
+   #include <raft/core/mpi_comms.hpp>
+   #include <raft/core/device_resources.hpp>
+
+   raft::device_resources resource_handle;
+   // ...
+   // initialize MPI_Comm
+   // ...
+   raft::comms::initialize_mpi_comms(resource_handle,  mpi_comm);
+   // ...
+   test_allreduce(resource_handle, 0);
+
+Deploying our`test_allreduce` function in Dask requires a lightweight Python interface, which we can accomplish using `pylibraft` and exposing the function through Cython:
+
+.. code-block:: cython
+   :caption: Example of wrapping test_allreduce() w/ cython
+
+   from pylibraft.common.handle cimport device_resources
+   from cython.operator cimport dereference as deref
+
+   cdef extern from “allreduce_test.hpp”:
+       void test_allreduce(device_resources const &handle, int root) except +
+
+   def run_test_allreduce(handle, root):
+       cdef const device_resources* h = <device_resources*><size_t>handle.getHandle()
+
+   test_allreduce(deref(h), root)
+
+Finally, we can use `raft_dask` to execute our new algorithm in a Dask cluster (please note this also uses `LocalCUDACluster` from the RAPIDS dask-cuda library):
+
+.. code-block:: python
+   :caption: Example of running test_allreduce() in Dask
+
+   from raft_dask.common import Comms, local_handle
+   from dask.distributed import Client, wait
+   from dask_cuda import LocalCUDACluster
+   cluster = LocalCUDACluster()
+   client = Client(cluster)
+
+   # Create and initialize Comms instance
+   comms = Comms(client=client)
+   comms.init()
+
+   def func_run_allreduce(sessionId, root):
+     handle = local_handle(sessionId)
+     run_test_allreduce(handle, root)
+
+   # Invoke run_test_allreduce on all workers
+   dfs = [
+     client.submit(
+       func_run_allreduce,
+       comms.sessionId,
+       0,
+       pure=False,
+       workers=[w]
+     )
+     for w in comms.worker_addresses
+   ]
+
+   # Wait until processing is done
+   wait(dfs, timeout=5)
+
+   comms.destroy()
+   client.close()
+   cluster.close()
@@ -26,7 +26,7 @@
     "numpy",
     "numba>=0.49",
     "joblib>=0.11",
-    "dask-cuda>=23.2*",
+    "dask-cuda==23.2.*",
     "dask>=2022.12.0",
     f"ucx-py{cuda_suffix}",
     "distributed>=2022.12.0",