Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-merge branch-23.02 to branch-23.04 #1214

Merged
merged 1 commit into from
Jan 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ While not exhaustive, the following general categories help summarize the accele
| Category | Examples |
| --- | --- |
| **Data Formats** | sparse & dense, conversions, data generation |
| **Dense Operations** | linear algebra, matrix and vector operations, slicing, norms, factorization, least squares, svd & eigenvalue problems |
| **Sparse Operations** | linear algebra, eigenvalue problems, slicing, symmetrization, components & labeling |
| **Dense Operations** | linear algebra, matrix and vector operations, reductions, slicing, norms, factorization, least squares, svd & eigenvalue problems |
| **Sparse Operations** | linear algebra, eigenvalue problems, slicing, norms, reductions, factorization, symmetrization, components & labeling |
| **Spatial** | pairwise distances, nearest neighbors, neighborhood graph construction |
| **Basic Clustering** | spectral clustering, hierarchical clustering, k-means |
| **Solvers** | combinatorial optimization, iterative solvers |
Expand Down Expand Up @@ -65,17 +65,17 @@ auto matrix = raft::make_device_matrix<float>(handle, n_rows, n_cols);

### C++ Example

Most of the primitives in RAFT accept a `raft::handle_t` object for the management of resources which are expensive to create, such CUDA streams, stream pools, and handles to other CUDA libraries like `cublas` and `cusolver`.
Most of the primitives in RAFT accept a `raft::device_resources` object for the management of resources which are expensive to create, such CUDA streams, stream pools, and handles to other CUDA libraries like `cublas` and `cusolver`.

The example below demonstrates creating a RAFT handle and using it with `device_matrix` and `device_vector` to allocate memory, generating random clusters, and computing
pairwise Euclidean distances:
```c++
#include <raft/core/handle.hpp>
#include <raft/core/device_resources.hpp>
#include <raft/core/device_mdarray.hpp>
#include <raft/random/make_blobs.cuh>
#include <raft/distance/distance.cuh>

raft::handle_t handle;
raft::device_resources handle;

int n_samples = 5000;
int n_features = 50;
Expand All @@ -93,12 +93,12 @@ raft::distance::pairwise_distance(handle, input.view(), input.view(), output.vie
It's also possible to create `raft::device_mdspan` views to invoke the same API with raw pointers and shape information:

```c++
#include <raft/core/handle.hpp>
#include <raft/core/device_resources.hpp>
#include <raft/core/device_mdspan.hpp>
#include <raft/random/make_blobs.cuh>
#include <raft/distance/distance.cuh>

raft::handle_t handle;
raft::device_resources handle;

int n_samples = 5000;
int n_features = 50;
Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ While not exhaustive, the following general categories help summarize the accele
* - Dense Operations
- linear algebra, matrix and vector operations, slicing, norms, factorization, least squares, svd & eigenvalue problems
* - Sparse Operations
- linear algebra, arithmetic, eigenvalue problems, slicing, symmetrization, components & labeling
- linear algebra, eigenvalue problems, slicing, norms, reductions, factorization, symmetrization, components & labeling
* - Spatial
- pairwise distances, nearest neighbors, neighborhood graph construction
* - Basic Clustering
Expand All @@ -45,6 +45,7 @@ While not exhaustive, the following general categories help summarize the accele
cpp_api.rst
pylibraft_api.rst
raft_dask_api.rst
using_comms.rst
contributing.md


Expand Down
97 changes: 97 additions & 0 deletions docs/source/using_comms.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
Using RAFT Comms
================

RAFT provides a communications abstraction for writing distributed algorithms which can scale up to multiple GPUs and scale out to multiple nodes. The communications abstraction is largely based on MPI and NCCL, and allows the user to decouple the design of algorithms from the environments where the algorithms are executed, enabling “write-once deploy everywhere” semantics. Currently, the distributed algorithms in both cuGraph and cuML are being deployed in both MPI and Dask clusters while cuML’s distributed algorithms are also being deployed on GPUs in Apache Spark clusters. This is a powerful concept as distributed algorithms can be non-trivial to write and so maintainability is eased and bug fixes reach further by increasing reuse as much as possible.

While users of RAFT’s communications layer largely get MPI integration for free just by installing MPI and using `mpirun` to run their applications, the `raft-dask` Python package provides a mechanism for executing algorithms written using RAFT’s communications layer in a Dask cluster. It will help to walk through a small example of how one would build an algorithm with RAFT’s communications layer.

First, an instance of `raft::comms_t` is passed through the `raft::device_resources` instance and code is written to utilize collective and/or point-to-point communications as needed.

.. code-block:: cpp
:caption: Example function written with the RAFT comms API

#include <raft/core/comms.hpp>
#include <raft/core/device_mdspan.hpp>
#include <raft/util/cudart_utils.hpp>

void test_allreduce(raft::device_resources const &handle, int root) {
raft::comms::comms_t const& communicator = handle.get_comms();
cudaStream_t stream = handle.get_stream();
raft::device_scalar<int> temp_scalar(stream);

int to_send = 1;
raft::copy(temp_scalar.data(), &to_send, 1, stream);
communicator.allreduce(temp_scalar.data(), temp_scalar.data(), 1,
raft::comms::opt_t::SUM, stream);
handle.sync_stream();
}

This exact function can now be executed in several different types of GPU clusters. For example, it can be executed with MPI by initializing an instance of `raft::comms::mpi_comms` with the `MPI_Comm`:

.. code-block:: cpp
:caption: Example of running test_allreduce() in MPI

#include <raft/core/mpi_comms.hpp>
#include <raft/core/device_resources.hpp>

raft::device_resources resource_handle;
// ...
// initialize MPI_Comm
// ...
raft::comms::initialize_mpi_comms(resource_handle, mpi_comm);
// ...
test_allreduce(resource_handle, 0);

Deploying our`test_allreduce` function in Dask requires a lightweight Python interface, which we can accomplish using `pylibraft` and exposing the function through Cython:

.. code-block:: cython
:caption: Example of wrapping test_allreduce() w/ cython

from pylibraft.common.handle cimport device_resources
from cython.operator cimport dereference as deref

cdef extern from “allreduce_test.hpp”:
void test_allreduce(device_resources const &handle, int root) except +

def run_test_allreduce(handle, root):
cdef const device_resources* h = <device_resources*><size_t>handle.getHandle()

test_allreduce(deref(h), root)

Finally, we can use `raft_dask` to execute our new algorithm in a Dask cluster (please note this also uses `LocalCUDACluster` from the RAPIDS dask-cuda library):

.. code-block:: python
:caption: Example of running test_allreduce() in Dask

from raft_dask.common import Comms, local_handle
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
client = Client(cluster)

# Create and initialize Comms instance
comms = Comms(client=client)
comms.init()

def func_run_allreduce(sessionId, root):
handle = local_handle(sessionId)
run_test_allreduce(handle, root)

# Invoke run_test_allreduce on all workers
dfs = [
client.submit(
func_run_allreduce,
comms.sessionId,
0,
pure=False,
workers=[w]
)
for w in comms.worker_addresses
]

# Wait until processing is done
wait(dfs, timeout=5)

comms.destroy()
client.close()
cluster.close()
2 changes: 1 addition & 1 deletion python/raft-dask/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
"numpy",
"numba>=0.49",
"joblib>=0.11",
"dask-cuda>=23.2*",
"dask-cuda==23.2.*",
"dask>=2022.12.0",
f"ucx-py{cuda_suffix}",
"distributed>=2022.12.0",
Expand Down